A nonparametric Bayesian approach to the rare type match problem

Giulia Cereda; Richard D. Gill

arXiv:1908.02954·stat.AP·May 30, 2022·Entropy

A nonparametric Bayesian approach to the rare type match problem

Giulia Cereda, Richard D. Gill

PDF

TL;DR

This paper introduces a Bayesian nonparametric method using a Poisson Dirichlet prior to address the rare type match problem in DNA analysis, simplifying likelihood ratio calculations when population proportions are unknown.

Contribution

It presents a novel nonparametric Bayesian approach that models population proportions without relying on profile labels, improving analysis of rare DNA matches.

Findings

01

Effective modeling of population proportions with Poisson Dirichlet distribution

02

Simplified likelihood ratio computation via Empirical Bayes

03

Well-suited for European Y-STR DNA profile data

Abstract

The "rare type match problem" is the situation in which the suspect's DNA profile, matching the DNA profile of the crime stain, is not in the database of reference. The evaluation of this match in the light of the two competing hypotheses (the crime stain has been left by the suspect or by another person) is based on the calculation of the likelihood ratio and depends on the population proportions of the DNA profiles, that are unknown. We propose a Bayesian nonparametric method that uses a two-parameter Poisson Dirichlet distribution as a prior over the ranked population proportions, and discards the information about the names of the different DNA profiles. This fits very well the data coming from European Y-STR DNA profiles, and the calculation of the likelihood ratio becomes quite simple thanks to a justified Empirical Bayes approach.

Tables2

Table 1. Table 1: Summaries of the distribution of log 10 ⁡ LR subscript 10 LR \log_{10}\mathrm{LR} , log 10 ⁡ ( LR | 𝐩 ) \log_{10}(\mathrm{LR}_{|\mathbf{p}}) , and log 10 ⁡ LR f subscript 10 subscript LR 𝑓 \log_{10}\mathrm{LR}_{f} .

	Min	1st Qu.	Median	Mean	3rd Qu.	Max	sd
$\log_{10} LR$	2.417	2.512	2.572	2.59	2.658	2.879	0.102
$\log_{10} {LR}_{\| 𝐩}$	2.392	2.507	2.542	2.529	2.56	2.602	0.045
$\log_{10} {LR}_{f}$	1.646	2.464	2.832	2.803	3.309	3.309	0.463

Table 2. Table 2: Summaries of the distribution of Diff 1 subscript Diff 1 \mathrm{Diff}_{1} , Diff 2 subscript Diff 2 \mathrm{Diff}_{2} , and Diff 3 subscript Diff 3 \mathrm{Diff}_{3} .

	Min	1st Qu.	Median	Mean	3rd Qu.	Max	sd
Diff₁	-0.146	-0.036	0.044	0.06	0.144	0.381	0.126
Diff₂	-0.891	-0.676	-0.221	-0.213	0.115	0.94	0.472

Equations107

LR = \frac{Pr ( D ∣ H _{p} )}{Pr ( D ∣ H _{d} )} .

LR = \frac{Pr ( D ∣ H _{p} )}{Pr ( D ∣ H _{d} )} .

LR = \frac{Pr ( D = d ∣ H = h _{p} )}{Pr ( D = d ∣ H = h _{d} )} or \frac{p ( d ∣ h _{p} )}{p ( d ∣ h _{d} )} .

LR = \frac{Pr ( D = d ∣ H = h _{p} )}{Pr ( D = d ∣ H = h _{d} )} or \frac{p ( d ∣ h _{p} )}{p ( d ∣ h _{d} )} .

0 \leq α < 1, and θ > - α .

0 \leq α < 1, and θ > - α .

\forall i W_{i} = V_{i} j = 1 \prod i - 1 (1 - V_{j}),

\forall i W_{i} = V_{i} j = 1 \prod i - 1 (1 - V_{j}),

V_{i} \sim Beta (1 - α, θ + i α) .

V_{i} \sim Beta (1 - α, θ + i α) .

Pr (H = h) \propto 1 for h \in {h_{p}, h_{d}} .

Pr (H = h) \propto 1 for h \in {h_{p}, h_{d}} .

(A, Θ) \sim p (α, θ) .

(A, Θ) \sim p (α, θ) .

P ∣ A = α, Θ = θ \sim P D (α, θ) .

P ∣ A = α, Θ = θ \sim P D (α, θ) .

X_{1}, X_{2}, ..., X_{n} ∣ P = p \sim_{i . i . d .} p .

X_{1}, X_{2}, ..., X_{n} ∣ P = p \sim_{i . i . d .} p .

X_{n + 1} ∣ P = p \sim p .

X_{n + 1} ∣ P = p \sim p .

X_{n + 2} ∣ P = p, X_{n + 1} = x_{n + 1}, H = h \sim {δ_{x_{n + 1}} p if h = h_{p} if h = h_{d} .

X_{n + 2} ∣ P = p, X_{n + 1} = x_{n + 1}, H = h \sim {δ_{x_{n + 1}} p if h = h_{p} if h = h_{d} .

N^{10}

N^{10}

X_{1}, ..., X_{10}

(2, 4, 2, 4, 3, 3, 10, 13, 5, 4)

Π_{[n]}^{Db}

Π_{[n]}^{Db}

Π_{[n + 1]}^{Db +}

Π_{[n + 2]}^{Db ++}

B = (b_{1}, b_{2}, b_{1}, b_{2}, b_{3}, b_{3}, b_{4}, b_{5}, b_{6}, b_{2}),

B = (b_{1}, b_{2}, b_{1}, b_{2}, b_{3}, b_{3}, b_{4}, b_{5}, b_{6}, b_{2}),

π_{[10]}^{Db} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}} .

π_{[10]}^{Db} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}} .

π_{[11]}^{Db +} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}, {11}},

π_{[11]}^{Db +} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}, {11}},

π_{[12]}^{Db ++} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}, {11, 12}} .

π_{[12]}^{Db ++} = {{1, 3}, {2, 4, 10}, {5, 6}, {7}, {8}, {9}, {11, 12}} .

P ∣ α, θ \sim P D (α, θ),

P ∣ α, θ \sim P D (α, θ),

X_{1}, X_{2}, ... ∣ P = p \sim_{i.i.d} p,

X_{1}, X_{2}, ... ∣ P = p \sim_{i.i.d} p,

P_{n}^{α, θ} (π_{[n]}) := Pr (Π_{[n]} = π_{[n]} ∣ α, θ) = \frac{[ θ + α ] _{k - 1; α}}{[ θ + 1 ] _{n - 1; 1}} i = 1 \prod k [1 - α]_{n_{i} - 1; 1},

P_{n}^{α, θ} (π_{[n]}) := Pr (Π_{[n]} = π_{[n]} ∣ α, θ) = \frac{[ θ + α ] _{k - 1; α}}{[ θ + 1 ] _{n - 1; 1}} i = 1 \prod k [1 - α]_{n_{i} - 1; 1},

Y_{1} = 1,

Y_{1} = 1,

Pr (Y_{n + 1} = i ∣ Y_{1}, ..., Y_{n}) = ⎩ ⎨ ⎧ \frac{θ + k α}{n + θ} \frac{n _{i} - α}{n + θ} if i = k + 1 if 1 \leq i \leq k

Pr (Y_{n + 1} = i ∣ Y_{1}, ..., Y_{n}) = ⎩ ⎨ ⎧ \frac{θ + k α}{n + θ} \frac{n _{i} - α}{n + θ} if i = k + 1 if 1 \leq i \leq k

p (π_{[n + 2]}^{Db++} ∣ h_{p}, π_{[n + 1]}^{Db+}, α, θ) = 1,

p (π_{[n + 2]}^{Db++} ∣ h_{p}, π_{[n + 1]}^{Db+}, α, θ) = 1,

p (π_{[n + 2]}^{Db++} ∣ h_{d}, π_{[n + 1]}^{Db+}, α, θ) = \frac{1 - α}{n + 1 + θ},

p (π_{[n + 2]}^{Db++} ∣ h_{d}, π_{[n + 1]}^{Db+}, α, θ) = \frac{1 - α}{n + 1 + θ},

lik (h ∣ x, y) \propto E (p (y ∣ x, Z, h) ∣ X = x) .

lik (h ∣ x, y) \propto E (p (y ∣ x, Z, h) ∣ X = x) .

p (z, h, x, y) = p (z) p (x ∣ z) p (h) p (y ∣ x, z, h) .

p (z, h, x, y) = p (z) p (x ∣ z) p (h) p (y ∣ x, z, h) .

p (z, h, y ∣ x) = p (z ∣ x) p (h) p (y ∣ x, z, h) .

p (z, h, y ∣ x) = p (z ∣ x) p (h) p (y ∣ x, z, h) .

p (h, y ∣ x) = p (h) E (p (y ∣ x, Z, h) ∣ X = x) .

p (h, y ∣ x) = p (h) E (p (y ∣ x, Z, h) ∣ X = x) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A nonparametric Bayesian approach to the rare type match problem

Giulia Cereda,

Leiden University, Mathematical Institute

Richard Gill,

Leiden University, Mathematical Institute

Combray Causality Consulting

Abstract

The “rare type match problem” is the situation in which, in a criminal case, the suspect’s DNA profile, matching the DNA profile of the crime stain, is not in the database of reference. Ideally, the evaluation of this observed match in the light of the two competing hypotheses (the crime stain has been left by the suspect or by another person) should be based on the calculation of the likelihood ratio and depends on the population proportions of the DNA profiles, that are unknown. We propose a Bayesian nonparametric method that uses a two-parameter Poisson Dirichlet distribution as a prior over the ranked population proportions, and discards the information about the names of the different DNA profiles. This model is validated using data coming from European Y-STR DNA profiles, and the calculation of the likelihood ratio becomes quite simple thanks to an Empirical Bayes approach for which we provided a motivation.

Forensic statistics, likelihood ratio, rare type match, Bayesian nonparametric

1 Introduction

The largely accepted method for evaluating how much some available data $\mathcal{D}$ (typically forensic evidence) helps discriminate between two hypotheses of interest (the prosecution hypothesis $H_{p}$ and the defense hypothesis $H_{d}$ ), is the calculation of the likelihood ratio (LR), a statistic that expresses the relative plausibility of the data under these hypotheses, defined as

[TABLE]

Widely considered the most appropriate framework to report a measure of the ‘probative value’ of the evidence regarding the two hypotheses (Robertson and Vignaux, 1995; Evett and Weir, 1998; Aitken and Taroni, 2004; Balding, 2005), it indicates the extent to which observed data support one hypothesis over the other. The likelihood ratio is supposed to be multiplied to the prior odds, in order to obtain the posterior odds. The latter is the quantity of interest for a judge, but the prior odds do not fall within the statistician competence. Even if a judge does not explicitly do the Bayesian updating, the likelihood ratio is still considered to be the correct way for the expert to communicate their evaluation of the weight of the evidence to the court. We refer the reader to Taroni et al. (2006) for an extensive dissertation on the use of likelihood ratios in forensic statistics. Forensic literature presents many approaches to calculate the LR, mostly divided into Bayesian and frequentist methods (see Cereda (2017a, b) for a careful differentiation between these two approaches).

This paper proposes a first application of a Bayesian nonparametric method to assess the likelihood ratio in the rare type match case, the challenging situation in which there is a match between some characteristic of the recovered material and of the control material, but this characteristic has not been observed before in previously collected samples (i.e. in the database of reference). This constitutes a problem because the value of the likelihood ratio depends on the unknown proportion of the matching characteristic in a reference population, and the uncertainty over this proportion, in standard practice for simpler situations, is dealt with using the relative frequency of the characteristic in the available database. In particular, we will focus on Y-STR data, for which the rare type match problem keeps turning up (Cereda, 2017b). The problem is so substantial that it has been called “the fundamental problem of forensic mathematics” (Brenner, 2010).

The use of our Bayesian nonparametric method involves the mathematical assumption that there are infinitely many different Y-STR profiles. Of course, we do not believe this literally to be true. We do suppose that there are so many profiles that we cannot say anything sensible about their exact number, except that it is very large. Hence, we pretend they are infinitely many, so that we can use the chosen Bayesian nonparametric method. The parameter of the model is the infinite-dimensional vector $\mathbf{p}$ , containing the (unknown) sorted population proportions of all possible Y-STR profiles. As a prior over $\mathbf{p}$ we choose the two-parameter Poisson Dirichlet distribution, and we model the uncertainty over its parameters $\alpha$ and $\theta$ through the use of a hyperprior. The information contained in the actual numbers, a list of which form the name of each Y-STR profiles is discarded, thereby reducing the full data $\mathcal{D}$ to a smaller set $D$ .

If compared to traditional Bayesian methods such as those discussed in Cereda (2017a), this method has the advantage of having a prior for the parameter p that is more realistic for the population we intend to model. Moreover, despite its technical theoretical background, we empirically derived an approximation that makes the method intuitive and simple to apply for practical use: indeed, simulation experiments show that a hybrid empirical approach that plugs in maximum likelihood estimators for the hyperparamenter is justified, at least when using populations that look like the complete Y-STR data from European populations. The last point in favour of the choice of the two-parameter Poisson Dirichlet prior over p is that it has the following sufficiency property: the probability of observing a new Y-STR profile only depends on the number of already observed Y-STR profiles and on the sample size, while the probability of observing a Y-STR profile that is already in the database only depends on its frequency in the database and on the sample size.

The paper is structured as follows: Section 2 discusses the state of the art regarding the rare type match problem and the evaluation of Y-STR matches. Section 3 presents our model, with the assumptions and the prior distribution chosen for the parameter $\mathbf{p}$ along with some theory on random partitions and the Chinese restaurant representation, useful to provide a prediction rule and a convenient and compact representation of the reduced data $D$ . Also, a lemma that facilitates computing the likelihood ratio in a very simple way is presented and proved. In Section 4, the likelihood ratio is derived. Section 5 illustrates the application of this model to a database sampled from an artificial population. We will discuss data-driven choices for the hyperparameters, and the derivation of the likelihood ratio values obtained both with and without reducing the data to partitions, in the ideal situation in which vector $\mathbf{p}$ is known. Also, the distribution of the likelihood ratios for different rare type match cases is analysed, along with the analysis of two different errors.

2 State of the art

Y-STR data have been our main motivation for studying the rare type match problem. Our model will reduce the relationship among various profiles to a binary “match” or “no match” equivalence relation. However, there is a big debate in the scientific community regarding whether it is acceptable to throw away the genetic structure of this kind of data. In this section we discuss the state of the art regarding the rare type match problem as a general issue and also the state of the art regarding methods to assess evidential values of matching Y-STR profiles. Indeed, the rare type match problem is an interesting problem also outside the Y-STR profile setting and our model can be applied also to other kinds of data.

2.1 The rare type match problem

The evaluation of a match between the profile of a particular piece of evidence and a suspect’s profile depends on the relative frequencies of that profile in the population of potential perpetrators. Indeed, it is intuitive that the rarer the matching profile, the more the suspect is in trouble. Problems arise when the observed frequency of the profile in a sample (database) from the population of interest is 0. This problem can be named as “the new type match problem”, but we decided to use the name “rare type match problem”, motivated by the fact that a profile that has zero occurrences is likely to be rare, even though it is challenging to quantify how rare it is. The rare type match problem is particularly important for new kinds of forensic evidence, such as results from DIP-STR markers (see for instance Cereda et al. (2014)) for which the available database size is still limited. The problem also occurs when more established types of evidence, such as Y-chromosome (or mitochondrial) DNA profiles are used, as explained in Section 2.2, and they have been our main motivation for the present study.

The rare type match problem has been addressed in well known non-forensic statistics domains, and many solutions have been proposed. The empirical frequency estimator, also called naive estimator, that uses the frequency of the characteristic in the database, puts unit probability mass on the set of already observed characteristics, and it is thus unprepared for the observation of a new type. A solution could be the add-constant estimators (in particular the well-known add-one estimator, due to Laplace (1814), and the add-half estimator of Krichevsky and Trofimov (1981)), which add a constant to the count of each type, included the unseen ones. However, these methods require knowledge of the number of possible unseen types, and they perform badly when this number is large compared to the sample size (see Gale and Church (1994) for an additional discussion). Alternatively, Good (1953), based on an intuition on A.M. Turing, proposed the Good-Turing estimator for the total unobserved probability mass, based on the proportion of singleton observations in the sample. An application of this estimator to the frequentist LR assessment in the rare type match case is proposed in Cereda (2017b).

Recently, Orlitsky et al. (2004) have introduced the high-profile estimator, which extends the tail of the naive estimator to the region of unobserved types. Anevski et al. (2017) improved this estimator and provided a consistency proof for their modified estimator, while original authors only provided heuristic reasoning that turned out to be rather difficult to make rigorous.

Bayesian nonparametric estimators for the probability of observing a new type have been proposed by Tiwari and Tripathi (1989) using Dirichlet processes, by Lijoi et al. (2007); De Blasi et al. (2015) using a general Gibbs prior, and by Favaro et al. (2009) with specific focus on the two-parameter Poisson Dirichlet prior, for which Arbel et al. (2017) provides large sample asymptotic and credible bands. In particular, Favaro et al. (2016) shows the link between the Bayesian nonparametric approach and the Good-Turing estimator. However, the LR assessment requires not only the probability of observing a new species but also the probability of observing this same species twice (according to the defence, the crime stain profile and the suspect profile are two independent observations): to our knowledge, the present paper is the first to address the problem of assessment of the LR in the rare type match case using a Bayesian nonparametric model. As a prior for $\mathbf{p}$ we will use the two-parameter Poisson Dirichlet distribution, which is proving useful in many discrete domains, in particular language modeling (Teh et al., 2006). Besides, it predicts a power-law behavior that describes an incredible variety of phenomena (Newman, 2005), among which the distribution of Y-STR haplotypes, too.

2.2 Evaluation of matching probabilities of Y-STR data

Y-STR profiles are typically used to detect male DNA in male-female DNA mixtures and are made of a number (usually varying from 7 to 23) of STR polymorphisms belonging to the non-recombining part of the Y-chromosome. Hence, there is no biological reason to assume independence among Y-STR loci. Even though the lack of recombination is in principle balanced by recurrent and backward mutations, the existence of such a dependency is studied and confirmed by Caliebe et al. (2015). For what concerns Y-STR population frequencies, the dependency between loci implies that no factorization (of the kind used for the autosomal markers) can be used to calculate these frequencies, and that the available databases are too small with respect to the large space of possible profiles, hence containing a high proportion of singletons. Indeed, the rare type match case is very frequent when using Y-STR data, and the use of simplistic methods such as the profile count is too conservative for practical use (it is bounded from below by the inverse of the database size) (Caliebe et al., 2015). In Andersen et al. (2018) and Andersen et al. (2019), approximations of the joint distribution with second and third order dependencies between loci are explored. However, as admitted by the authors, there is a limitation due to the inadequacy of the sizes of available databases that makes it necessary to use simulations, that in turns are oversimplification of real data.

Moreover, as highlighted already in 1994 by Balding and Nichols (1994) match probabilities cannot be identified with population frequencies since a match can be due also to a certain degree of relatedness between the two donors of the stain. This is particularly true for Y-STR data, since Y-STR profiles are inherited almost identical from father to son. More recently, Andersen and Balding (2017) investigate the influence of relatedness on matches and make a study concluding that 95 $\%$ of matching profiles are separated by a relatively small number (50-100) of meiosis, hence the degree of relatedness is a very influential factor, according to their study. They thus propose a method to describe the distribution of the number of males with a matching Y-STR profile, extending the approach to mixtures in Andersen and Balding (2019). One limitation of this study is that it is based on extensive simulations which have to be performed anew in each new application, on assumptions about genetic evolutionary model, and on parameters hard to be validated.

There is a huge number of methods developed to assess the evidential values for Y-STR data. Among those that are developed precisely for the rare type match case there are Egeland and Salas (2008), Brenner (2010), Cereda (2017b) and Cereda (2017a). All these methods do not take into account genetic information contained in the allelic numbers forming a Y-STR DNA profile. For instance, due to relatedness, the observation of a particular Y-STR profile increases the probability of observing the same Y-STR profile again or Y-STR profiles that differ only for few alleles. We refer the reader to Roewer (2009); Buckleton et al. (2011); Willuweit et al. (2011); Wilson et al. (2003) for models that use population genetics for coancestry. These models are not designed to be used for the rare type match case, but the Discrete Laplace method presented in Andersen et al. (2013) can be successfully applied to that purpose, as shown in Cereda (2017b).

After a careful study of the available methods for assessing likelihood ratios (or matching probabilities) for Y-STR matches, one can see that they are of different nature (some of them do their best to exploit the genetic structure, others don’t) and based on different assumptions. In our opinion none of them is fully satisfactory and at the same time useful for the rare type match and for general cases. As far as we are concerned, we decided to asses the probability of the reduction of the data taking into account only the equalities and inequalities among profiles rather than considering the specific Y-STR observed characteristics. We know part of the scientific community will not agree with our approach, precisely because of the results shown in Andersen and Balding (2017), but we believe in the accuracy of our method. Moreover, even though Y-STR data have been the main motivation for this study, this model is actually applicable to different kinds of data (in principle for all forensic data that shows power law behaviour). When applied to data without genetic structures (such as tire marks or glass fragments), these kind of criticisms should fade away.

The Y-STR marker system will thus be employed here as an extreme but in practice common and important way in which the problem of assessing the evidential value of rare type match can arise, but we don’t pretend to solve the problem entirely. We believe that the analyst should perform several analyses using different models and different assumptions, and compare the performance of the different methods, in order to try to learn from the differences (or lack of differences) between the conclusions which would follow from each method individually.

The big issues of working with Y-STR data is the unavailability of reliable databases, which are representative of actual population. The YHRD database is in fact a collection of databases coming from police or laboratories. We are well aware of this limitation.

3 The model

3.1 Notation and data

Throughout the paper the following notation is chosen: random variables and their values are denoted, respectively, with uppercase and lowercase characters: $x$ is a realization of $X$ . Random vectors and their values are denoted, respectively, by uppercase and lowercase bold characters: $\mathbf{p}$ is a realization of the random vector $\mathbf{P}$ . Probability is denoted with $\Pr(\cdot)$ , while the density of a continuous random variable $X$ is denoted alternatively by $p_{X}(x)$ or by $p(x)$ when the subscript is clear from the context. For a discrete random variable $Y$ , the density notation $p_{Y}(y)$ and the discrete one $\Pr(Y=y)$ will be interchangeably used. Moreover, we will use shorthand notation like $p(y\mid x)$ to stand for the probability density of Y with respect to the conditional distribution of $Y$ given $X=x$ .

Notice that in Formula (1), $\mathcal{D}$ was regarded as the event corresponding to the observation of the available data. However, later in the paper, $\mathcal{D}$ will be regarded as a random variable generically representing the data. The particular data at hand will correspond to the value $d$ . In that case, the following notation will thus be preferred:

[TABLE]

Lastly, notice that “DNA types” is used throughout the paper as a general term to indicate Y-STR profiles.

The data used in the present study were obtained from the Y Chromosome Haplotype Reference Database (YHRD) (Willuweit and Roewer, 2007; Purps et al., 2014). Here, only 7 of the markers included in the PowerPlex1Y23 system (PPY23, Promega Corporation, Madison, WI) were investigated : DYS19, DYS389I, DYS389II.I, DYS390, DYS391, DYS392, DYS393. The dependence between these 7 “core markers” is studied in Caliebe et al. (2015) that concludes that “each of these seven markers contribute indispensable information about each other markers from the same set”.

3.2 Model assumptions

Our model is based on the two following assumptions:

Assumption 1

There are infinitely many different DNA types in Nature.

This assumption, already used by e.g. Kimura (1964) in the ‘infinite alleles model’, allows the use of Bayesian nonparametric methods and is very useful for instance in ‘species sampling problems’ when the total number of possible different species in Nature cannot be specified. This assumption is sensible also in case of Y-STR DNA profiles since the state space of possible different haplotypes is so large that it can be considered infinite.

Assumption 2

The names of the different DNA types do not contain relevant information.

Actually, the specific sequence of numbers that forms a DNA profile carries information: if two profiles show few differences this means that they are separated by few mutation drifts, hence the profiles share a relatively recent common ancestor. However, this information can be very difficult to use and it might be wiser not to try to use it in the LR assessment. This is the reason why we will treat DNA types as “colours”, and only consider the partition into different categories. Stated otherwise, we put no topological structure on the space of the DNA types.

Notice that this assumption makes the model a priori suitable for any characteristic which has many different possible types showing power law behaviour, thus the approach described in this paper still holds, in principle, after replacing ‘DNA types’ with any other category.

3.3 Prior

In Bayesian statistics, parameters of interest are modeled through random variables. The (prior) distribution over a parameter should represent the prior uncertainty about its value.

The assessment of the LR for the rare type match involves two unknown parameters of interest: one is $h\in\{h_{p},h_{d}\}$ , representing the unknown true hypothesis, the other is $\mathbf{p}$ , the vector of the unknown population frequencies of all DNA profiles in the population of potential perpetrators. The dichotomous random variable $H$ is used to model parameter $h$ , and the posterior distribution of this random variable, given the data, is the ultimate aim of the forensic inquiry. Similarly, a random variable $\mathbf{P}$ is used to model the uncertainty over $\mathbf{p}$ . Because of Assumption 1, $\mathbf{p}$ is an infinite-dimensional parameter, hence the need for Bayesian nonparametric methods (Hjort et al., 2010; Ghosal and Van der Vaart, 2017). In particular, because of Assumption 2, data can be reduced to partitions, as explained in Section 3.5, and it will turn out that the distribution of these partitions does not depend on the order of the $p_{i}$ . Hence, we can define the parameter $\mathbf{p}$ as having values in $\nabla_{\infty}=\{(p_{1},p_{2},...)\mid p_{1}\geq p_{2}\geq...,\sum p_{i}=1,p_{i}>0\}$ , the ordered infinite-dimensional simplex. The uncertainty about its value will be expressed by the two-parameter Poisson Dirichlet prior (Pitman and Yor, 1997; Feng, 2010; Buntine and Hutter, 2012; Pitman, 2006).

The two-parameter Poisson-Dirichlet distribution can be defined through the following stick-breaking representation (Ghosal and Van der Vaart, 2017):

Definition 1 (two-parameter GEM distribution).

Given $\alpha$ and $\theta$ satisfying the following conditions:

[TABLE]

the vector $\mathbf{W}=(W_{1},W_{2},...)$ is said to be distributed according to the GEM( $\alpha,\theta$ ), if

[TABLE]

where $V_{1}$ , $V_{2}$ ,… are independent random variables distributed according to

[TABLE]

It holds that $W_{i}>0$ , and $\sum_{i}W_{i}=1.$

The GEM distribution (short for Griffin-Engen-McCloskey distribution’) is well-known in the literature as the “stick-breaking prior” since it measures the random sizes in which a stick is broken iteratively.

Definition 2 (Two-parameter Poisson Dirichlet distribution).

Given $\alpha$ and $\theta$ satisfying condition (2), and a vector $\mathbf{W}=(W_{1},W_{2},...)\sim\text{GEM}(\alpha,\theta)$ , the random vector $\mathbf{P}=(P_{1},P_{2},...)$ obtained by ranking $\mathbf{W}$ , such that $P_{i}\geq P_{i+1}$ , is said to be Poisson Dirichlet distributed PD $(\alpha,\theta)$ . Parameter $\alpha$ is called the discount parameter, while $\theta$ is the concentration parameter.

For our model we will not allow $\alpha=0$ , hence we will assume $0<\alpha<1$ , in order to have a prior that shows a power-law behavior as the one observed in the YHRD database (see Section 5.1).

The main reason that prompted us to choose the two-parameter Poisson Dirichlet distribution among the possible Bayesian nonparametric priors is given by model fitting (see Section 5.1). However, there is a very nice feature of this model. It is the only one that has the following very convenient sufficiency property (Zabell, 2005): the probability of observing a new species only depends on the number of already observed species and on the sample size, and the probability of observing an already seen species only depends on its frequency in the sample and on the sample size.

Lastly, we point out that, in practice, we cannot assume that we know the parameters $\alpha$ and $\theta$ : we will resolve this by using a hyperprior.

3.4 Bayesian network representation of the model

The typical data to evaluate in case of a match is $\mathcal{D}=(E,B)$ , where $E=(E_{s},E_{t})$ , and

$E_{s}$ = suspect’s DNA type,

$E_{t}$ = crime stain’s DNA type (matching the suspect’s type),

$B$ = a reference database of size $n$ , which is treated here as a random sample of DNA types from the population of possible perpetrators.

The hypotheses of interest for the case are:

$h_{p}$ = The crime stain originated from the suspect,

$h_{d}$ = The crime stain originated from someone else.

In agreement with Assumption 2, the model will ignore information about the names of the DNA types: data $\mathcal{D}=(E,B)$ will thus be reduced to $D$ accordingly. The Bayesian network of Figure 1 encapsulates the conditional dependencies of the random variables $(H,A,\Theta,\mathbf{P},X_{1},...,X_{n+2},D)$ , whose joint distribution is defined below in terms of the conditional distribution, using the factorization implied by the Bayesian network itself.

$H$ is a dichotomous random variable that represents the hypotheses of interest and can take values $h\in\{h_{p},h_{d}\}$ , according to the prosecution or the defense, respectively. A uniform prior on the hypotheses is chosen for mathematical convenience since it will not affect the likelihood ratio (the variable $H$ being in the conditioning part):

[TABLE]

( $A,\Theta$ ) is the random vector that represents the hyperparameters $\alpha$ and $\theta$ , satisfying condition (2). The joint prior density of these two parameters will be generically denoted as $p(\alpha,\theta)$ :

[TABLE]

For obvious reasons, this will be called the ‘hyperprior’ throughout the text.

The random vector $\mathbf{P}$ with values in $\nabla_{\infty}$ represents the ranked population frequencies of Y-STR profiles. $\mathbf{P}=(p_{1},p_{2},...)$ means that $p_{1}$ is the frequency of the most common DNA type in the population, $p_{2}$ is the frequency of the second most common DNA type, and so on. As a prior for $\mathbf{P}$ we use the two-parameter Poisson Dirichlet distribution:

[TABLE]

The database is assumed to be a random sample from the population. Integer-valued random variables $X_{1}$ , …, $X_{n}$ are here used to represent the (unknown) ranks in the population of the frequencies of the DNA types in the database. For instance, $X_{3}=5$ means that the third individual in the database has the fifth most common DNA type in the population. Given $\mathbf{p}$ they are an i.i.d. sample from $\mathbf{p}$ :

[TABLE]

$X_{n+1}$ represents the rank in the population ordering of the suspect’s DNA type. It is again an independent draw from $\mathbf{p}$ .

[TABLE]

$X_{n+2}$ represents the rank in the population ordering, of the crime stain’s DNA type. According to the prosecution, given $X_{n+1}=x_{n+1}$ , this random variable is deterministic (it is equal to $x_{n+1}$ with probability 1). According to the defence, it is another sample from $\mathbf{p}$ , independent of the previous ones:

[TABLE]

In order to observe $X_{1}$ , …, $X_{n+2}$ , one would need, by definition, to know the rank, in terms of population proportions, of the frequency of each DNA type in the database. This is not known, hence $X_{1},...,X_{n}$ are not observed.

Section 3.5 recalls some notions about random partitions, useful before defining node $D$ , the ‘reduced’ data that we want to evaluate.

3.5 Random partitions and database partitions

A partition of a set $S$ is an unordered collection of nonempty and disjoint subsets of $S$ , the union of which forms $S$ . Particularly interesting for our model are partitions of the set $S=[n]=\{1,...,n\}$ , denoted as $\pi_{[n]}$ . The set of all partitions of $[n]$ will be denoted as $\mathcal{P}_{[n]}$ . Random partitions of $[n]$ will be denoted as $\Pi_{[n]}$ , $n\in\mathbb{N}$ . Also, a partition of $n$ is a finite nonincreasing sequence of positive integers that sum up to $n$ . Partitions of $n$ will be denoted as $\pi_{n}$ , while random partitions as $\Pi_{n}$ .

Given a sequence of integer valued random variables $X_{1},...,X_{n}$ , let $\Pi_{[n]}(X_{1},X_{2},...,X_{n})$ be the random partition defined by the equivalence classes of their indices using the random equivalence relation $i\sim j$ if and only if $X_{i}=X_{j}$ . This construction allows one to build a “reduction” map from the set of values of $X_{1},...,X_{n}$ to the set of the partitions of $[n]$ as in the following example ( $n=10$ ):

[TABLE]

Similarly, and in agreement with Assumption 2, in our model we can consider the reduction of data which ignores information about the names of the DNA types: this is achieved, for instance, by retaining from the database only the equivalence classes of the indices of the individuals, according to the equivalence relation “has the same DNA type”. Stated otherwise, the database is reduced to the partition $\pi_{[n]}^{\text{Db}}$ , obtained using these equivalence classes. However, the database only supplies part of the data. There are also two new DNA profiles that are equal to one another (and different from the already observed ones in the rare type match case). Considering the suspect’s profile we obtain the partition $\pi^{\text{Db}+}_{[n+1]}$ , where the first $n$ integers are partitioned as in $\pi^{\text{Db}}_{[n]}$ , and $n+1$ constitutes a class by itself. Considering the crime stain profile we obtain the partition $\pi^{\text{Db}++}_{[n+2]}$ where the first $n$ integers are partitioned as in $\pi^{\text{Db}}_{[n]}$ , and $n+1$ and $n+2$ belong to the same (new) class. Random variables $\Pi^{\text{Db}}_{[n]}$ , $\Pi^{\text{Db}+}_{[n+1]}$ , and $\Pi^{\text{Db}++}_{[n+2]}$ are used to model $\pi_{[n]}^{\text{Db}}$ , $\pi_{[n+1]}^{\text{Db}+}$ , and $\pi_{[n+2]}^{\text{Db}++}$ , respectively.

Since prosecution and defense agree on the distribution of $X_{1},...,X_{n+1}$ , but not on the distribution of $X_{n+2}|X_{1},...,X_{n+1}$ , they also agree on the distribution of $\Pi^{\text{Db}+}_{[n+1]}$ but disagree on the distribution of $\Pi^{\text{Db}++}_{[n+2]}$ (see (4)).

The crucial points of the model are the following:

the random partitions defined through random variables $X_{1}$ , …, $X_{n+2}$ and through database are the same.

[TABLE] 2. 2.

although $X_{1}$ , …, $X_{n+2}$ were not observable, the random partitions $\Pi_{[n]}^{\text{Db}},\Pi_{[n+1]}^{\text{Db+}}$ , and $\Pi_{[n+2]}^{\text{Db++}}$ are observable.

To clarify, consider the following example of a database with $k=6$ different DNA types, from $n=10$ individuals:

[TABLE]

where $b_{i}$ is the name of the $i$ th DNA type according to the order chosen for the database. This database can be reduced to the partition of $[10]$ :

[TABLE]

Then, the part of reduced data whose distribution is agreed on by prosecution and defense is

[TABLE]

while the entire reduced data $D$ can be represented as

[TABLE]

Now, assume that we know the rank in the population of each of the DNA types in the database: we know that $b_{1}$ is, for instance, the second most frequent type, $b_{2}$ is the fourth most frequent type, and so on. Stated otherwise, we are now assuming that we observe the variables $X_{1}$ , …, $X_{n+2}$ : for instance, $X_{1}=2$ , $X_{2}=4$ , $X_{3}=2$ , $X_{4}=4$ , $X_{5}=3$ , $X_{6}=3$ , $X_{7}=10$ , $X_{8}=13$ , $X_{9}=5$ , $X_{10}=4$ , $X_{11}=9$ , $X_{12}=9$ (as in (5)). It is easy to check that $\Pi_{[10]}(X_{1},...,X_{10})=\pi_{[10]}^{\text{Db}}$ , $\Pi_{[11]}(X_{1},...,X_{11})=\pi_{[11]}^{\text{Db+}}$ , and $\Pi_{[12]}(X_{1},...,X_{12})=\pi_{[12]}^{\text{Db++}}.$

Coming back to our model, data is defined as $D=\pi^{\text{Db}++}_{[n+2]}$ , obtained partitioning the database enlarged with the two new observations (or partitioning $X_{1},...,X_{n+2}$ ). Node $D$ of Figure 1 is defined accordingly.

Notice that, given $X_{1},...,X_{n+2}$ , $D$ is deterministic. An important result is that, according to Proposition 4 in Pitman (1992), it is possible to derive directly the distribution of $D\mid\alpha,\theta,H$ . In particular, it holds that if

[TABLE]

and

[TABLE]

then, for all $n\in\mathbb{N}$ , the random partition $\Pi_{[n]}=\Pi_{[n]}(X_{1},...,X_{n})$ has the following distribution:

[TABLE]

where $n_{i}$ is the size of the $i$ th block of $\pi_{[n]}$ (the blocks are here ordered according to their least element), and $\forall x,b\in\mathbb{R},a\in\mathbb{N}$ , $[x]_{a,b}:=\begin{cases}\prod_{i=0}^{a-1}(x+ib)&\text{if }a\in\mathbb{N}\backslash\{0\}\\ 1&\text{if }a=0\end{cases}.$ This formula, also known as the Pitman sampling formula, is further studied in Pitman (1995) and shows that $\mathbb{P}_{n}^{\alpha,\theta}(\pi_{[n]})$ does not depend on $X_{1},...,X_{n}$ , but only on the sizes and the number of classes in the partitions. It follows that we can get rid of the intermediate layer of nodes $X_{1}$ , …, $X_{n+2}$ . Moreover, it holds have $\Pr(D|\alpha,\theta,h_{p})=\mathbb{P}_{n+1}^{\alpha,\theta}(\pi_{[n+1]}^{\text{Db+}})$ , while $\Pr(D|\alpha,\theta,h_{d})=\mathbb{P}_{n+2}^{\alpha,\theta}(\pi_{[n+2]}^{\text{Db++}})$ .

The model of Figure 1 can thus be simplified to the one in Figure 2.

3.6 Chinese Restaurant representation

There is an alternative characterization of this model, called “Chinese restaurant process”, due to Aldous (1985) for the one-parameter case, and studied in detail for the two-parameter version in Pitman (2006). It is defined as follows: consider a restaurant with infinitely many tables, each one infinitely large. Let $Y_{1},Y_{2},...$ be integer-valued random variables that represent the seating plan: tables are ranked in order of occupancy, and $Y_{i}=j$ means that the $i$ th customer seats at the $j$ th table to be created. The process is described by the following transition matrix:

[TABLE]

where $k$ is the number of tables occupied by the first $n$ customers, and $n_{i}$ is the number of customers that occupy table $i$ . The process depends on two parameters $\alpha$ and $\theta$ with the same conditions (2). From (9) one can easily see the sufficientness property mentioned in Section 3.3.

$Y_{1},...,Y_{n}$ are not i.i.d., nor exchangeable, but it holds that $\Pi_{[n]}(Y_{1},...,Y_{n})$ is distributed as $\Pi_{[n]}(X_{1},...,X_{n})$ , with $X_{1},...,X_{n}$ defined as in (3), and they are both distributed according to the Pitman sampling formula (8) (Pitman, 2006).

Stated otherwise, we can obtain $\pi_{[n]}^{\text{Db}}$ through seating plan of $n$ costumers or partitioning the $X_{1},...,X_{n}$ of the database and we obtain the same partition $\pi_{[n]}^{\text{Db}}$ . Similarly $\pi^{\text{Db}+}_{[n+1]}$ is obtained when a new customer has chosen an unoccupied table (remember we are in the rare type match case), and $\pi^{\text{Db}++}_{[n+2]}$ is obtained when the $(n+2)$ nd customer goes to the table already chosen by the $(n+1)$ st customer (suspect and crime stain have the same DNA type). In particular, thanks to (9), we can write:

[TABLE]

since the $(n+2)$ nd customer goes to the same table as the $(n+1)$ st (who was sitting alone).

3.7 A useful Lemma

The following lemma can be applied to four general random variables $Z$ , $X$ , $Y$ , and $H$ whose conditional dependencies are described by the Bayesian network of Figure 3. The importance of this result is due to the possibility of using it for assessing the likelihood ratio in a very common forensic situation: the prosecution and the defense disagree on the distribution of the entirety of data ( $Y$ ) but agree on the distribution of a part it ( $X$ ), and these distributions depend on parameters ( $Z$ ).

Lemma 3.1.

Given four random variables $Z$ , $H$ , $X$ and $Y$ , whose conditional dependencies are represented by the Bayesian network of Figure 3, the likelihood function for $h$ , given $X=x$ and $Y=y$ satisfies

[TABLE]

The Bayesian representation of the model, in Figure 3, allow to factor the joint probability density of $Z$ , $H$ , $X$ and $Y$ as

[TABLE]

By Bayes formula, $p(z)\,p(x\mid z)=p(x)\,p(z\mid x)$ . This rewriting corresponds to reversing the direction of the arrow between $Z$ and $X$ :

$Z$$H$$X$$Y$

The random variable $X$ is now a root node. This means that when we probabilistically condition on $X=x$ , the graphical model changes in a simple way: we can delete the node $X$ , but just insert the value $x$ as a parameter in the conditional probability tables of the variables $Z$ and $Y$ which formerly had an arrow from node $X$ . The next graph represents this model:

$Z$$H$$x$$x$$Y$

This tells us, that conditional on $X=x$ , the joint density of $Z$ , $Y$ and $H$ is equal to

[TABLE]

The joint density of $H$ and $Y$ given $X$ is obtained by integrating out the variable $Z$ . It can be expressed as a conditional expectation value since $p(z\mid x)$ is the density of $Z$ given $X=x$ . We find:

[TABLE]

Recall that this is the joint density of two of our variables, $H$ and $Y$ , after conditioning on the value $X=x$ . Let us now also condition on $Y=y$ . It follows that the density of $H$ given $X=x$ and $Y=y$ is proportional (as function of $H$ , for fixed $x$ and $y$ ) to the same expression, $p(h)\mathbb{E}(p(y\mid x,Z,h)\mid X=x)$ .

This is a product of the prior for $h$ with some function of $x$ and $y$ . Since posterior odds equals prior odds times likelihood ratio, it follows that the likelihood function for $h$ , given $X=x$ and $Y=y$ satisfies

[TABLE]

Corollary 3.1.

Given four random variables $Z$ , $H$ , $X$ and $Y$ , whose conditional dependencies are represented by the network of Figure 3, the likelihood ratio for $H=h_{1}$ against $H=h_{2}$ given $X=x$ and $Y=y$ satisfies

[TABLE]

4 The likelihood ratio

From now on we will omit the superscripts Db, Db+, and Db++ for ease of notation.

Using the hypotheses and the reduction of data $D$ defined in Section 3, the likelihood ratio will be defined as

[TABLE]

The last equality holds due to the fact that $\Pi_{[n+1]}$ is a deterministic function of $\Pi_{[n+2]}$ .

Corollary 3.1 can be applied to our model since defense and prosecution agree on the distribution of $\pi_{[n+1]}$ , but not on the distribution of $\pi_{[n+2]}$ , and data depends on parameters $\alpha$ and $\theta$ . Thus, if $(A,\Theta)$ play the role of $Z$ , $X=\Pi_{[n+1]}$ , and $Y=\Pi_{[n+2]}$ , by using (10) and (11), we obtain:

[TABLE]

The expected value is taken with respect to the posterior distribution of $A,\Theta\mid\Pi_{[n+1]}=\pi_{[n+1]}$ . The solution we propose in this paper is to deal with the uncertainty about $\alpha$ and $\theta$ by using MLE estimators and plug those estimators into the formula. Notice that this is equivalent to a hybrid approach, in which the parameters are estimated in a frequentist way and their values are plugged into the Bayesian LR. In the future, we plan to use MCMC methods to calculate as exactly as possible the exact posterior distribution, given assumed priors on the hyperparameters.

By defining the random variable ${\displaystyle\Phi=n\frac{1-A}{n+1+\Theta}}$ we can write the LR as

[TABLE]

5 Analysis on YHRD database

In this section, we present the study we made on a database of 18,925 Y-STR 23-loci profiles from 129 different locations in 51 countries in Europe (Purps et al., 2014)111The database has previously been cleaned by Mikkel Meyer Andersen (http://people.math.aau.dk/~mikl/?p=y23).. Our analyses are performed by considering only 7 Y-STR loci (DYS19, DYS389 I, DYS389 II, DYS3904, DYS3915, DY3926, DY3937) but similar results have been observed with the use of 10 loci.

5.1 Model fitting

First, we calculated the maximum likelihood estimators $\alpha_{\text{MLE}}$ and $\theta_{\text{MLE}}$ using the entire database and the likelihood defined by (8). Their values are $\alpha_{\text{MLE}}=0.51$ and $\theta_{\text{MLE}}=216.$

In Figure 4, the ranked frequencies of the 18’925 Y-STR profiles of the YHRD database are compared to the relative frequencies of samples of size $n$ obtained from several realizations of PD( $\alpha_{MLE},\theta_{MLE}$ ). To do so we run several times the Chinese Restaurant seating plan (up to $n=18,925$ customers): each run is used to approximate a new realization $\mathbf{p}$ from the PD( $\alpha_{MLE},\theta_{MLE}$ ). As explained in Section 3.6, the partition of the customers into tables is the same as the partition obtained from an i.i.d. sample of size $n$ from $\mathbf{p}$ . We can see that for the most common haplotypes (left part of the plot) there is some discrepancy. However, we are interested in rare haplotypes, which typically have a frequency belonging to the right part of the plot. In that region, the two-parameter Poisson Dirichlet follows the distribution of the data quite well.

The dotted line shows in Figure 4 the asymptotic behavior on the two-parameter Poisson Dirichlet distribution. Indeed, if $\mathbf{P}\sim\text{PD}(\alpha,\theta)$ , then

[TABLE]

for a random variable $Z$ such that $Z^{-\alpha}=\Gamma(1-\alpha)/S_{\alpha}$ . This power-law behavior describes an incredible variety of phenomena (Newman, 2005).

The thick line in Figure 4 also seems to have a power-law behavior, and to be honest, we were hoping to get the same asymptotic slope of the prior.This is not what we observe, but in Figure 5 it can be seen that for such a big value of $\theta$ we would need a bigger database (at least $n=10^{6}$ ) to see the correct slope.

5.2 Log-likelihood

It is also interesting to investigate the shape of the log-likelihood function for $\alpha$ and $\theta$ given $\pi_{[n+1]}$ . It is defined as

[TABLE]

In Figure 6 the log-likelihood reparametrized using $\phi={\displaystyle n\frac{1-\alpha}{n+1+\theta}}$ instead of $\alpha$ is displayed. A Gaussian distribution centered in the MLE parameters and with covariance matrix the inverse of the Fisher Information, is also displayed (in dashed lines). This is not done to show an asymptotic property, but to show the symmetry of the log-likelihood, which validates approximation of $\mathbb{E}(\Phi\mid\Pi_{[n+1]}=\pi_{[n+1]})$ with the marginal mode $\Phi_{MLE}$ , at least when we choose an hyperprior $p(\phi,\theta)$ that is flat around $(\phi_{MLE},\theta_{MLE})$ : indeed, it holds that $p(\phi,\theta\mid\pi_{[n+1]})\propto l_{n+1}(\phi,\theta)\times p(\phi,\theta)$ .

Hence, one could safely make this approximation if one believed that this symmetry would also be true in the real data situation at hand:

[TABLE]

Notice that this is equivalent to a hybrid approach, commonly called “Empirical Bayes”, in which the parameters are estimated through the MLE (frequentist) and their values are plugged into the Bayesian LR. We would like to reiterate that we are not using maximum likelihood estimates of the parameters because we consider the likelihood ratio from a frequentist point of view. Our aim is to calculate a Bayesian likelihood ratio, and we have observed empirically that using the maximum likelihood estimates of the parameters we can approximate this value.

Hence, in case of a rare type match problem, and using the YHRD database as the reference database, we have $\log_{10}\textrm{LR}=4.59$ , that corresponds to say that it is approximately 40,000 times more likely to observe the reduced data under the prosecution hypothesis than under the defence hypothesis.

5.3 True LR

It is also interesting to study the likelihood ratio values obtained with out method according to formula (4), and to compare it with the ‘true’ ones, meaning the LR values obtained when the vector $\mathbf{p}$ is known, over simulated rare type match cases. This corresponds to the desirable (even though completely imaginary) situation of knowing the ranked list of the frequencies of all the DNA types in the population of interest. The model can be represented by the Bayesian network of Figure 7.

The likelihood ratio in this case can be obtained using again Corollary 3.1, where now $X_{1}$ , …, $X_{n+1}$ play the role of $Z$ . Indeed, now that $\mathbf{p}$ is known, the unobservable part of the model are the ranks of the types in the database.

[TABLE]

Notice that, in the rare type case, $X_{n+1}$ is observed only once among the $X_{1}$ , …, $X_{n+1}$ . Hence, we call it a singleton, and its distribution given $\mathbf{p},\pi_{[n+1]}$ is the same as the distribution of each other singleton. Let $s_{1}$ denote the number of singletons, and $\mathcal{S}$ the set of indices of singletons observations in the augmented database. It holds

[TABLE]

Notice also that the knowledge of $\mathbf{p}$ and $\pi_{[n+1]}$ , is not enough to observe $X_{1},...,,X_{n+1}$ .

Let us denote as $X_{1}^{*}$ , .., $X^{*}_{K}$ the $K$ different values taken by $X_{1}$ , …, $X_{n+1}$ , ordered decreasingly according to the frequency of their values. Stated otherwise, if $n_{i}$ is the frequency of $x_{i}^{*}$ among $x_{1},...,x_{n+1},$ then $n_{1}\geq n_{2}\geq...\geq n_{K}$ . 222Moreover, in case $X_{i}^{*}$ and $X_{j}^{*}$ have the same frequency ( $n_{i}=n_{j}$ ), then they are ordered increasingly according to their values. For instance, if $X_{1}=2$ , $X_{2}=4$ , $X_{3}=2$ , $X_{4}=4$ , $X_{5}=3$ , $X_{6}=3$ , $X_{7}=10$ , $X_{8}=13$ , $X_{9}=5$ , $X_{10}=4$ , $X_{11}=9$ , then $X_{1}^{*}=4,X_{2}^{*}=2,X_{3}^{*}=3,X_{4}^{*}=5,X_{5}^{*}=9,X_{6}^{*}=10,X_{7}^{*}=13$ .

By definition, it holds that

[TABLE]

Notice that $(n_{1},n_{2},...,n_{K})$ is a partition of $n+1$ , which will be denoted as $\pi_{n+1}$ . In the example, $\pi_{n+1}=(3,2,2,1,1,1,1)$ . Since the distribution of ${\displaystyle\sum_{j:\,n_{j}=1}p_{{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}X_{j}^{*}}}}$ only depends on $\pi_{n+1}^{\text{Db+}}$ , the latter can replace $\pi_{[n+1]}^{\text{Db+}}$ . Thus, it holds that

[TABLE]

A more compact representation for $\pi_{n+1}^{\text{Db+}}$ can be obtained by using two vectors $\mathbf{a}$ and $\mathbf{r}$ where $a_{j}$ are the distinct numbers occurring in the partition, increasingly ordered, and each $r_{j}$ is the number of repetitions of $a_{j}$ . $J$ is the length of these two vectors, and it holds that $n+1=\sum_{j=1}^{J}a_{j}r_{j}.$ In the example above we have that $\pi_{n+1}$ can be represented by $(\mathbf{a},\mathbf{r})$ with $\mathbf{a}=(1,2,3)$ and $\mathbf{r}=(4,2,1),J=3$ .

There is an unknown map, $\chi$ , treated here as latent variable, which assigns the ranks of the DNA types, ordered according to their frequency in Nature, to one of the number $\{1,2,...,J\}$ corresponding to the position in $\mathbf{a}$ of its frequency in the sample, or to [math] if the type if not observed. Stated otherwise,

[TABLE]

Given $\pi_{n+1}=(\mathbf{a},\mathbf{r})$ , $\chi$ must satisfy the following set of $J$ conditions:

[TABLE]

In addition, it should not be allowed that a profiles observed $k_{N}$ times in the population is observed $k_{n}>k_{N}$ times in the sample. Hence we have to add a further condition:

[TABLE]

where $N$ is the size of the entire population.

The map $\chi$ can be represented by a vector $\boldsymbol{\chi}=(\chi_{1},\chi_{2},...)$ such that $\chi_{i}=\chi(i)$ . In the example we have that

[TABLE]

Notice that, given $\pi_{n+1}=(\mathbf{a},\mathbf{r})$ , the knowledge of $\boldsymbol{\chi}$ implies the knowledge of $X_{1}^{*}$ , …, $X_{K}^{*}$ : indeed it is enough to consider the position of the ranked positive values of $\boldsymbol{\chi}$ , and to solve ties by considering the positions themselves (if $\chi_{i}=\chi_{j}$ , than the order is given by $i$ and $j$ ). For instance, in the example if we sort the positive values of $\boldsymbol{\chi}$ and we collect their positions we get $(4,2,3,5,9,10,13)$ : the reader can notice that we got back to $X_{1}^{*},...,X_{7}^{*}$ .

This means that to obtain the distribution of $X_{1}^{*},...,X_{K}^{*}|\pi_{n+1},\mathbf{p}$ , which appears in (14), it is enough to obtain the distribution of $\boldsymbol{\chi}|\mathbf{a},\mathbf{r},\mathbf{p}$ , and since we are only interested in the mean of the sum of singletons in samples of size $n+1$ from the distribution of $X_{1}^{*},...,X_{K}^{*}|\mathbf{a},\mathbf{r},\mathbf{p}$ , we can just simulate samples from the distribution of $\boldsymbol{\chi}|\mathbf{a},\mathbf{r},\mathbf{p}$ and sum the $p_{i}$ such that $\chi_{i}=1.$

It holds that

[TABLE]

where the proportionality factor is $\frac{(n+1)!}{\prod_{1\leq j\leq J}(a_{j}!)^{r_{j}}}$ .

Details of the Metropolis Hashting algorithm

Notice that for the model we assumed $\mathbf{p}$ to be infinitely long, but for simulations we will use a finite $\mathbf{\bar{p}}$ , of length $m$ . This is equivalent to assume that only $m$ elements in the infinite $\mathbf{p}$ are positive, and the remaining infinite tail is made of zeros.

To simulate samples from the distribution of $\boldsymbol{\chi}|\mathbf{a},\mathbf{r},\mathbf{p}$ we use a Metropolis-Hastings algorithm on the space of the vectors $\boldsymbol{\chi}$ satisfying the $J+m$ conditions (15) and (16). Then the state space of the Metropolis-Hastings Markov chain is made of all vectors of length $m$ whose elements belong to $\{0,1,...,J\}$ , and satisfy the conditions (15) and (16). If we start with an initial point $\boldsymbol{\chi}_{0}$ which satisfies (15) and, at each move $t$ of the Metropolis-Hastings we swap two different values $\chi_{i}$ and $\chi_{j}$ inside the vector, condition (15) remains satisfied while conditions (16) must be checked at every iterations. The Metropolis factor is the ratio of the two likelihoods $p(\mathbf{a},\mathbf{r}\mid\boldsymbol{\chi}_{t},\mathbf{p})$ and $p(\mathbf{a},\mathbf{r}\mid\boldsymbol{\chi}_{t+1},\mathbf{p})$ where $\boldsymbol{\chi}_{t}$ and $\boldsymbol{\chi}_{t+1}$ differs only because $\chi_{i}$ and $\chi_{j}$ are exchanged. Hence, using (17), the Metropolis factor for every move is

[TABLE]

. Every exchange move is then accepted with probability $R$ . The algorithm is iterated $N=10^{5}$ times, with thinning steps of $10^{3}$ and a burnin period of 20000 iterations. Since it holds that

[TABLE]

for every accepted $\boldsymbol{\chi}$ we calculate the sum of all $p_{i}$ s such that $\chi_{i}=1$ and we use the average to approximate the denominator of (14).

The algorithm is based on a similar one proposed in Anevski et al. (2017).

This method allows us to approximate the ‘true’ LR when the vector $\mathbf{p}$ is known. This is almost never the case, but we can put ourselves in a fictitious world where we know $\mathbf{p}$ (such as the frequencies in the YHRD database, or as in the following section the frequencies from a smaller population) and compare the true values for the $\textrm{LR}_{\mid\mathbf{p}}$ with the one obtained by applying our Bayesian nonparametric model when $\mathbf{p}$ is unknown.

5.4 Frequentist-Bayesian analysis of the error

A real Bayesian statistician chooses the prior and hyperprior according to his beliefs. Depending on the choice of the hyperprior over $\alpha$ and $\theta$ she may or may not believe in the approximation (13), but she does not really talk of ‘error’. However, hardliner Bayesian statisticians are a rare species, and most of the time the Bayesian procedure consists in choosing priors (and hyperpriors) which are a compromise between personal beliefs and mathematical convenience. It is thus interesting to investigate the performance of such priors. This can be done by comparing the Bayesian likelihood ratio with the likelihood ratio that one would obtain if the vector $\mathbf{p}$ was known, and for the same reduction of data. This is what we call ‘error’: in other words, at the moment we are considering the Bayesian nonparametric method proposed in this paper as a way to estimate (notice the frequentist terminology) the true $\textrm{LR}_{\mid\mathbf{p}}$ . If we denote by $p_{x}$ the population proportion of the matching profile, another interesting comparison is the one between the Bayesian likelihood ratio and the frequentist likelihood ratio $1/p_{x}$ (here denoted as $\textrm{LR}_{f}$ ) that one would obtain knowing $\mathbf{p}$ , but not reducing the data to partition. This is a sort of benchmark comparison and tells us how much we lose by using the Bayesian nonparametric methodology, and by reducing data.

In total there are three quantities of interest ( $\log_{10}\textrm{LR}$ , $\log_{10}\textrm{LR}_{\mid\mathbf{p}}$ , and $\log_{10}\textrm{LR}_{f}$ ), and two differences of interest, which will be denoted as

•

$\textrm{Diff}_{1}=\log_{10}\textrm{LR}-\log_{10}\textrm{LR}_{\mid\mathbf{p}}$ (loss due to choice of the Poisson Dirichlet model and approximation (13)),

•

$\textrm{Diff}_{2}=\log_{10}\textrm{LR}-\log_{10}\textrm{LR}_{f}$ (overall loss).

In order to analyse these five quantities, we can study their distribution over different rare type match cases. However, there is an obstacle. The Metropolis-Hastings algorithm described in Section 5.3 is too slow to be used with the entire European database of Purps et al. (2014) of size $n=18^{\prime}925$ .

In order to make the computational effort feasible, we consider the haplotype frequencies for the sole Dutch population (of size $n=2037$ ), and we pretend that they are the frequencies from the entire population of possible perpetrators. This population is summarised by the following $\mathbf{a}$ , and $\mathbf{r}$ :

[TABLE]

and the maximum likelihood estimators for $\alpha$ and $\theta$ are $\alpha_{\text{MLE}}=0.62$ , $\theta_{\text{MLE}}=22$ .

In this way we can use the Metropolis-Hashting algorithm to simulate $\textrm{LR}_{\mid\mathbf{p}}$ . The model fitting is still good enough, as shown in Figure 8 (as a side note, notice that the asymptotic behavior is reached faster for this smaller value of $\theta_{\text{MLE}}=22$ ).

However, it is important to stress that the Gaussian shape and consequently the approximation (13) is not empirically supported for small databases of size $n=100$ .

In Table 1 and Figure 9 (a) we compare the distribution of $\log_{10}\textrm{LR}_{|\mathbf{p}}$ , $\log_{10}\textrm{LR}$ , and $\log_{10}\textrm{LR}_{f}$ obtained by 96 samples of size 100 from the Dutch population . Each sample represents a different rare type match case with a specific database of reference of size $n=100$ .

The distribution of the benchmark likelihood ratio $\log_{10}\textrm{LR}_{f}$ has more variation than the distribution of the Bayesian likelihood ratio, while $\log_{10}\textrm{LR}_{|\mathbf{p}}$ appears to be the most concentrated around its mean.

In Table 2 and Figure 9 (b) we consider the distribution of the two differences, Diff1 and Diff2. Diff1 is the smallest and the most concentrated: it ranges between -0.146 and 0.381 and has a small standard deviation. It means that the nonparametric Bayesian likelihood ratio obtained as in (13) can be thought of as a good approximation of the frequentist likelihood ratio for the same reduction of data ( $\log_{10}\textrm{LR}_{\mid\mathbf{p}}$ ), even though we have not empirically validated the approximation for small databases of size 100. This difference is due to three things: the approximation (13), the MLE estimation of the hyperparameters, and the choice of a prior distribution (two-parameter Poisson Dirichlet) which is quite realistic, as shown in Figure 8, but not perfectly fitting the actual population.

Notice that the difference increases if the Bayesian nonparametric likelihood is compared to the benchmark likelihood ratio (Diff2). Here, the reduction of data comes into play, too. However, the difference ranges within one order of magnitude, but most of the time lies between -0.676 and 0.115, thus small.

6 Conclusion

This paper discusses the first application of a Bayesian nonparametric method to likelihood ratio assessment in forensic science, in particular to the challenging situation of the rare type match. If compared to traditional Bayesian methods such as those described in Cereda (2017a), it presents many advantages. First of all, the prior chosen for the parameter $\mathbf{p}$ is more realistic for the population whose frequencies we want to model. Moreover, despite the theoretical background on which it lies may seem very technical and difficult, the method is extremely simple in practice, thanks to the use of an empirical Bayes approximation. More could be done in the future: in particular regarding approximation (13). The posterior expectation in the denominator could, for instance, be treated using MCMC algorithms or ABC algorithms. Then, we can try to improve the efficiency of the Metropolis Hashting algorithm defined in Section 5.3 in order to be used with bigger and better populations. The big problem is how to use these methods when relevant populations are poorly defined and accessible data-bases are of doubtful relevance. We don’t solve those problems.

It is not clear whether other methods are better. This is all very open and controversial. We suggest the analyst to perform several very different analyses and think carefully what the differences between the conclusions tells her. With this aim, we plan to compare this Bayesian nonparametric method to other existing methods for the rare type match problem, investigating calibration and validation through the use of ECE plots (Ramos et al., 2013).

Acknowledgment

We are indebted to Jim Pitman and Alexander Gnedin for their help in understanding their important theoretical results, to Mikkel Meyer Andersen for providing a cleaned version of the database of Purps et al. (2014) and to Stephanie Van der Pas, Pierpaolo De Blasi and Giacomo Aletti for their useful opinions and comments. This research was supported by the Swiss National Science Foundation, through grants no. P2LAP2 $\_$ 178195

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Robertson and Vignaux (1995) Robertson, B.; Vignaux, G.A. Interpreting Evidence: Evaluating Forensic Science in the Courtroom ; John Wiley & Sons: Chichester, 1995.
2Evett and Weir (1998) Evett, I.; Weir, B. Interpreting DNA evidence: Statistical Genetics for Forensic Scientists ; Sinauer Associates: Sunderland, 1998.
3Aitken and Taroni (2004) Aitken, C.; Taroni, F. Statistics and the Evaluation of Evidence for Forensics Scientists ; John Wiley & Sons: Chichester, 2004.
4Balding (2005) Balding, D. Weight-of-evidence for Forensic DNA Profiles ; John Wiley & Sons: Chichester, 2005.
5Taroni et al. (2006) Taroni, F.; Aitken, C.; Garbolino, P.; Biedermann, A. Bayesian Networks and Probabilistic Inference in Forensic Science ; John Wiley & Sons: Chichester, 2006.
6Cereda (2017 a) Cereda, G. Bayesian approach to LR in case of rare type match. Statistica Neerlandica 2017 , 71 , 141–164.
7Cereda (2017 b) Cereda, G. Impact of model choice on LR assessment in case of rare haplotype match (frequentist approach). Scandinavian Journal of Statistics 2017 , 44 , 230–248.
8Brenner (2010) Brenner, C.H. Fundamental problem of forensic mathematics – The evidential value of a rare haplotype. Forensic Science International: Genetics 2010 , 4 , 281–291.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A nonparametric Bayesian approach to the rare type match problem

Abstract

1 Introduction

2 State of the art

2.1 The rare type match problem

2.2 Evaluation of matching probabilities of Y-STR data

3 The model

3.1 Notation and data

3.2 Model assumptions

3.3 Prior

Definition 1** (two-parameter GEM distribution).**

Definition 2** (Two-parameter Poisson Dirichlet distribution).**

3.4 Bayesian network representation of the model

3.5 Random partitions and database partitions

3.6 Chinese Restaurant representation

3.7 A useful Lemma

Lemma 3.1**.**

Corollary 3.1**.**

4 The likelihood ratio

5 Analysis on YHRD database

5.1 Model fitting

5.2 Log-likelihood

5.3 True LR

Details of the Metropolis Hashting algorithm

5.4 Frequentist-Bayesian analysis of the error

6 Conclusion

Acknowledgment

Definition 1 (two-parameter GEM distribution).

Definition 2 (Two-parameter Poisson Dirichlet distribution).

Lemma 3.1.

Corollary 3.1.