Bayesian Propagation of Record Linkage Uncertainty into Population Size   Estimation of Human Rights Violations

Mauricio Sadinle

arXiv:1812.09590·stat.ME·December 27, 2018

Bayesian Propagation of Record Linkage Uncertainty into Population Size Estimation of Human Rights Violations

Mauricio Sadinle

PDF

TL;DR

This paper introduces a Bayesian linkage-averaging method to incorporate record linkage uncertainty into population size estimates, improving accuracy in human rights violation studies with imperfect data.

Contribution

It proposes a two-stage Bayesian approach that propagates linkage uncertainty into population estimates, allowing flexible model integration and better handling of data errors.

Findings

01

Effective propagation of linkage uncertainty demonstrated in case study

02

Method accommodates various linkage and capture-recapture models

03

Improves population size estimates in noisy, incomplete data contexts

Abstract

Multiple-systems or capture-recapture estimation are common techniques for population size estimation, particularly in the quantitative study of human rights violations. These methods rely on multiple samples from the population, along with the information of which individuals appear in which samples. The goal of record linkage techniques is to identify unique individuals across samples based on the information collected on them. Linkage decisions are subject to uncertainty when such information contains errors and missingness, and when different individuals have very similar characteristics. Uncertainty in the linkage should be propagated into the stage of population size estimation. We propose an approach called linkage-averaging to propagate linkage uncertainty, as quantified by some Bayesian record linkage methodologies, into a subsequent stage of population size estimation.…

Tables5

Table 1. Table 1: Construction of levels of disagreement for lists from El Salvador.

		Levels of Disagreement
Field	Similarity Measure	$0$	$1$	$2$	$3$
Given Name	Normalized Levenshtein²²2 Modification of Sadinle (2014) to account for Hispanic naming conventions.	0	$(0, 0.25]$	$(0.25, 0.5]$	$(0.5, 1]$
Family Name	Normalized Levenshtein\@footnotemark	0	$(0, 0.25]$	$(0.25, 0.5]$	$(0.5, 1]$
Year of Death	Absolute Difference	0	1	2–3	4+
Month of Death	Absolute Difference	0	1	2–3	4+
Day of Death	Absolute Difference	0	1–2	3–7	8+
Place of Death	Binary Comparison	Agree	Disagree

Table 2. Table 2: Prior truncation points λ f l subscript 𝜆 𝑓 𝑙 \lambda_{fl} for the m f l subscript 𝑚 𝑓 𝑙 m_{fl} parameters in the joint duplicate detection and record linkage for three datafiles from El Salvador.

$l$	Given	Family	Year	Month	Day	Municipality
	Name		Date of Death
0	0.95	0.95	0.90	0.80	0.70	0.80
1	0.99	0.99	0.95	0.90	0.70	–
2	0.99	0.99	0.99	0.99	0.70	–

Table 3. Table 3: Marginal posterior distributions of the frequencies of inclusion patterns.

Table 4. Table 4: Linkage-averaging for two-sample estimates of N 𝑁 N . N ^ ^ 𝑁 \hat{N} : expected value computed from p la ( N ) subscript 𝑝 la 𝑁 p_{\textsc{la}}(N) . CI: credible interval. The plots in the second column have the same horizontal and vertical scales.

Table 5. Table 5: Summaries of linkage-averaging for three-sample population size estimates using individual graphical models. N ^ ^ 𝑁 \hat{N} : expected value computed from p la ( N ) subscript 𝑝 la 𝑁 p_{\textsc{la}}(N) . CI: credible interval. The plots in the third column have the same horizontal and vertical scales. The data sources are 1: ER-TL, 2: CDHES, 3: UNTC.

Equations36

γ_{ij}^{f} = l, if S_{f} (i, j) \in I_{f l} .

γ_{ij}^{f} = l, if S_{f} (i, j) \in I_{f l} .

Γ_{ij} ∣ Z_{i} = Z_{j} \sim ii d G_{1}; Γ_{ij} ∣ Z_{i} \neq = Z_{j} \sim ii d G_{0} .

Γ_{ij} ∣ Z_{i} = Z_{j} \sim ii d G_{1}; Γ_{ij} ∣ Z_{i} \neq = Z_{j} \sim ii d G_{0} .

P_{1}(\boldsymbol{\gamma}^{obs}_{ij}\mid\Phi_{1})=\prod_{f=1}^{F}\Bigg{[}\prod_{l=0}^{L_{f}-1}(m_{fl})^{I(\gamma^{f}_{ij}=l)}(1-m_{fl})^{I(\gamma^{f}_{ij}>l)}\Bigg{]}^{I_{obs}(\gamma_{ij}^{f})},

P_{1}(\boldsymbol{\gamma}^{obs}_{ij}\mid\Phi_{1})=\prod_{f=1}^{F}\Bigg{[}\prod_{l=0}^{L_{f}-1}(m_{fl})^{I(\gamma^{f}_{ij}=l)}(1-m_{fl})^{I(\gamma^{f}_{ij}>l)}\Bigg{]}^{I_{obs}(\gamma_{ij}^{f})},

P (n^{*} ∣ N, θ (m), m) = N! h \in {0, 1}^{K} \prod \frac{θ _{h} ( m ) ^{n_{h}}}{n _{h} !} .

P (n^{*} ∣ N, θ (m), m) = N! h \in {0, 1}^{K} \prod \frac{θ _{h} ( m ) ^{n_{h}}}{n _{h} !} .

P (N ∣ n, m) = \frac{P ( n ∣ N , m ) p ( N )}{\sum _{N} P ( n ∣ N , m ) p ( N )},

P (N ∣ n, m) = \frac{P ( n ∣ N , m ) p ( N )}{\sum _{N} P ( n ∣ N , m ) p ( N )},

P (n ∣ N, m)

P (n ∣ N, m)

P (n ∣ N, m)

P (n ∣ N, m)

= \frac{N !}{\prod _{h \in {0, 1}^{K}} n _{h} !} \frac{Ψ _{m} ( α + n ^{*} )}{Ψ _{m} ( α )},

Ψ_{m} (α) = \frac{\prod _{l = 1}^{L} \prod _{h_{C_{l}}} Γ ( α _{h_{C_{l}}} )}{Γ ( \sum _{h \in {0, 1}^{K}} α _{h} ) ^{Q} \prod _{l = 2}^{L} \prod _{h_{S_{l}}} Γ ( α _{h_{S_{l}}} )} .

Ψ_{m} (α) = \frac{\prod _{l = 1}^{L} \prod _{h_{C_{l}}} Γ ( α _{h_{C_{l}}} )}{Γ ( \sum _{h \in {0, 1}^{K}} α _{h} ) ^{Q} \prod _{l = 2}^{L} \prod _{h_{S_{l}}} Γ ( α _{h_{S_{l}}} )} .

P (N ∣ n) = \frac{p ( N ) \sum _{m} P ( n ∣ N , m ) p ( m )}{\sum _{N} p ( N ) \sum _{m} P ( n ∣ N , m ) p ( m )},

P (N ∣ n) = \frac{p ( N ) \sum _{m} P ( n ∣ N , m ) p ( m )}{\sum _{N} p ( N ) \sum _{m} P ( n ∣ N , m ) p ( m )},

h_{zk}=\left\{\begin{array}[]{ll}1,&\hbox{ if there exists a record $i\in\mathbf{X}_{k}$ such that $Z_{i}=z$;}\\ 0,&\hbox{ otherwise.}\end{array}\right.

h_{zk}=\left\{\begin{array}[]{ll}1,&\hbox{ if there exists a record $i\in\mathbf{X}_{k}$ such that $Z_{i}=z$;}\\ 0,&\hbox{ otherwise.}\end{array}\right.

p_{\textsc l a} (N) \approx \frac{1}{d} t = 1 \sum d p (N ∣ n (Z^{(t)})),

p_{\textsc l a} (N) \approx \frac{1}{d} t = 1 \sum d p (N ∣ n (Z^{(t)})),

p_{\textsc l a} (N) \approx \frac{1}{d b} t = 1 \sum d v = 1 \sum b I (N = N^{(v, t)}) .

p_{\textsc l a} (N) \approx \frac{1}{d b} t = 1 \sum d v = 1 \sum b I (N = N^{(v, t)}) .

p_{\textsc l a} (N) \equiv E_{Z ∣ X} [p_{\textsc c} (N ∣ n (Z))] = Z \sum p_{\textsc c} (N ∣ n (Z)) p_{\textsc l} (Z ∣ X),

p_{\textsc l a} (N) \equiv E_{Z ∣ X} [p_{\textsc c} (N ∣ n (Z))] = Z \sum p_{\textsc c} (N ∣ n (Z)) p_{\textsc l} (Z ∣ X),

\displaystyle Var(N\mid\mathbf{X})=\

\displaystyle Var(N\mid\mathbf{X})=\

p_{\textsc l a} (N) = Z \sum m \sum p_{\textsc c} (N ∣ n (Z), m) p_{\textsc c} (m ∣ n (Z)) p_{\textsc l} (Z ∣ X),

p_{\textsc l a} (N) = Z \sum m \sum p_{\textsc c} (N ∣ n (Z), m) p_{\textsc c} (m ∣ n (Z)) p_{\textsc l} (Z ∣ X),

p_{\textsc c} (m ∣ n (Z)) = \frac{p ( m ) \sum _{N} L _{m} ( N ∣ n ( Z )) p ( N )}{\sum _{m} \sum _{N} L _{m} ( N ∣ n ( Z )) p ( N ) p ( m )},

p_{\textsc c} (m ∣ n (Z)) = \frac{p ( m ) \sum _{N} L _{m} ( N ∣ n ( Z )) p ( N )}{\sum _{m} \sum _{N} L _{m} ( N ∣ n ( Z )) p ( N ) p ( m )},

\displaystyle Var(N\mid\mathbf{X})=\

\displaystyle Var(N\mid\mathbf{X})=\

+ E_{Z ∣ X} {E_{m ∣ Z} [V a r (N ∣ Z, m)]},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Bayesian Propagation of Record Linkage Uncertainty into Population Size Estimation of Human Rights Violations

Mauricio Sadinlelabel=e1][email protected] [ University of Washington

Department of Biostatistics

Department of Statistics

Center for Statistics and the Social Sciences

University of Washington

Box 357232

Seattle, WA 98195

Abstract

Multiple-systems or capture-recapture estimation are common techniques for population size estimation, particularly in the quantitative study of human rights violations. These methods rely on multiple samples from the population, along with the information of which individuals appear in which samples. The goal of record linkage techniques is to identify unique individuals across samples based on the information collected on them. Linkage decisions are subject to uncertainty when such information contains errors and missingness, and when different individuals have very similar characteristics. Uncertainty in the linkage should be propagated into the stage of population size estimation. We propose an approach called linkage-averaging to propagate linkage uncertainty, as quantified by some Bayesian record linkage methodologies, into a subsequent stage of population size estimation. Linkage-averaging is a two-stage approach in which the results from the record linkage stage are fed into the population size estimation stage. We show that under some conditions the results of this approach correspond to those of a proper Bayesian joint model for both record linkage and population size estimation. The two-stage nature of linkage-averaging allows us to combine different record linkage models with different capture-recapture models, which facilitates model exploration. We present a case study from the Salvadoran civil war, where we are interested in estimating the total number of civilian killings using lists of witnesses’ reports collected by different organizations. These lists contain duplicates, typographical and spelling errors, missingness, and other inaccuracies that lead to uncertainty in the linkage. We show how linkage-averaging can be used for transferring the uncertainty in the linkage of these lists into different models for population size estimation.

Capture-recapture,

Counting casualties,

Data linkage,

Decomposable graphical model,

Duplicate detection,

Entity resolution,

Multiple-systems estimation,

Multiple record linkage,

keywords:

\arxiv

arXiv:0000.0000 \startlocaldefs

\endlocaldefs

T1Partially supported by NSF grants SES-11-30706 and SES-11-31897.

1 Introduction

In the context of armed conflicts, a basic question is how many human rights violations occurred in a given time and space. While a complete enumeration is not typically feasible, it is common to find multiple organizations monitoring and collecting reports on those violations. Given that witnesses or victims may report an event to different organizations, and different witnesses may report an event to the same organization, the associated record systems often end up containing multiple entries referring to the same violations, even within the same data source. Those reports may contain different degrees of detail and accuracy, and typically do not contain unique identifiers of the victims, such as national identification numbers. Therefore, even the more basic question of how many unique human rights violations have been reported cannot be easily answered. Record linkage techniques are required to detect duplicated reports within each source and to link coreferent reports across data sources. The result of this linkage stage is often used to derive estimates of the total number of unreported violations using capture-recapture or multiple-systems estimation. The Human Rights Data Analysis Group — HRDAG111Website: https://hrdag.org/ has been a leader and a pioneer in using these methodologies to study human rights violations in several countries (see, e.g. Lum, Price and Banks, 2013; Price, Gohdes and Ball, 2015; Price and Ball, 2015). Here we revisit a case from El Salvador, where we combine three data sources to explore the question of how many civilians were killed during the Salvadoran civil war (1980–1991) in San Salvador.

A limitation in this area of application is that population sizes are estimated taking a given linkage of the lists as being the correct one. Current practice therefore understates the overall uncertainty around the population size as it ignores the uncertainty from the linkage. We propose a simple procedure called linkage-averaging for incorporating the uncertainty from record linkage into subsequent population size estimation using multiple-systems or capture-recapture models. Linkage-averaging is possible thanks to the advent of Bayesian partitioning approaches that provide proper accounts of the uncertainty in the linkage process (e.g. Matsakis, 2010; Sadinle, 2014; Steorts, Hall and Fienberg, 2016). Linkage-averaging requires two stages. First, we use a Bayesian partitioning approach to obtain a posterior sample of possible linkages between the lists. Then, for each of those linkages we obtain a posterior distribution on the population size using a capture-recapture model. The individual population size posteriors are combined by taking a simple average. This approach is appealing for being simple and intuitive, and we show that if the capture-recapture model uses only functions of the linkage then linkage-averaging is equivalent to a proper Bayesian approach to joint record linkage and population size estimation. The two-stage nature of linkage-averaging facilitates model exploration as linkage results can be reused with different capture-recapture models, and it is also well suited for lists with restricted access due to confidentiality constraints given that the information used for the linkage does not have to be transferred to the analyst doing population size estimation. Linkage-averaging has broader applicability since, for example, census coverage evaluation (e.g. Ericksen, Kadane and Tukey, 1989; Hogan, 1992, 1993; Anderson and Fienberg, 1999) and disease prevalence estimation (e.g. LaPorte et al., 1993; Madigan and York, 1997) are also carried out by linking multiple data sources followed by population size estimation.

We review Bayesian partitioning approaches to record linkage in Section 2, Bayesian approaches for population size estimation in Section 3, and in Section 4 we show how to combine them using linkage-averaging. Finally, in Section 5 we apply this approach to the case study from El Salvador mentioned above.

2 Bayesian Partitioning Record Linkage Approaches

Let $\mathbf{X}_{k}$ be the $k$ th data source or list, which contains $r_{k}$ records as its rows, $k=1,\dots,K$ . We define $\mathbf{X}=(\mathbf{X}_{1},\dots,\mathbf{X}_{K})^{T}$ as the concatenated list containing all the $r=\sum_{k}r_{k}$ records coming from the $K$ different sources. The total number of different fields available from the lists is $F$ , and if one of these fields is not recorded in a list then it will be missing in $\mathbf{X}$ for all records coming from that list. With $n\leq r$ different individuals represented in $\mathbf{X}$ , jointly detecting duplicates within lists and linking records across lists is equivalent to partitioning the rows of $\mathbf{X}$ into the $n$ groups of coreferent records. This coreference partition (Matsakis, 2010; Sadinle, 2014; Steorts, Hall and Fienberg, 2016) is the parameter of interest in joint duplicate detection and record linkage.

A computationally simple representation of partitions uses arbitrary labelings of the partition’s groups. Let $\mathbf{Z}=(Z_{1},\dots,Z_{r})$ be a vector of length $r$ representing a labeling of the records in $\mathbf{X}$ , such that two records receive the same label if and only if they are a match/coreferent. An intuitive way of thinking of $\mathbf{Z}$ is as an underlying unique identifier that we want to recover. Although the labeling given by $\mathbf{Z}$ is arbitrary, any equivalent relabeling leads to the same partition of the records, which is what we care about. Indeed, two records are a match or coreferent if and only if $Z_{i}=Z_{j}$ . To fix ideas, the vectors $\mathbf{Z}=(1,2,1,3,3)$ and $\mathbf{Z}=(4,5,4,2,2)$ are two labelings of the same partition of five elements, since in both $Z_{1}=Z_{3}\neq Z_{4}=Z_{5}$ , and $Z_{2}$ gets its own value.

From a Bayesian point of view, one obtains a posterior distribution on $\mathbf{Z}$ given $\mathbf{X}$ , and the variability captured by this posterior should ideally reflect the uncertainty in the record linkage and duplicate detection procedure. There exist two types of approaches to obtain such a posterior on $\mathbf{Z}$ : direct modeling approaches and comparison-based approaches.

2.1 Direct-Modeling Approaches

A number of Bayesian approaches to both duplicate detection and record linkage have been proposed where one directly models the information contained in the lists/datafiles (Matsakis, 2010; Tancredi and Liseo, 2011; Fortini et al., 2002; Gutman, Afendulis and Zaslavsky, 2013; Steorts, Hall and Fienberg, 2016), that is, one proposes a model $P(\mathbf{X}\mid\mathbf{Z})$ for the information observed in the lists, and a posterior on $\mathbf{Z}$ is derived as $p(\mathbf{Z}\mid\mathbf{X})\propto p(\mathbf{Z})P(\mathbf{X}\mid\mathbf{Z})$ , with the help of a prior on partitions $p(\mathbf{Z})$ . To write down $P(\mathbf{X}\mid\mathbf{Z})$ one needs crafting specific models for each type of field in the lists. The models of Matsakis (2010); Tancredi and Liseo (2011); Steorts, Hall and Fienberg (2016); Steorts (2015) share the characteristic that given a value of $\mathbf{Z}$ , the clusters of coreferent records are modeled as distortions of some latent record containing the true information of a latent individual. These approaches currently mostly handle categorical fields, with the exceptions of Steorts (2015) who proposed an empirical Bayes approach to model names, and Liseo and Tancredi (2011) who handle continuous fields under normality. In practice, however, fields that are complicated to model, such as strings, addresses, phone numbers, or dates, are also important to detect coreferent records. These type of fields are often subject to typographical and other types of errors, which makes it important to take into account partial agreements between their values. Existing direct modeling approaches also currently do not handle missing data, although this extension should be easy to implement.

2.2 Comparison-Based Approaches

A number of approaches to record linkage and duplicate detection rely on the often reasonable assumption that two records referring to the same entity should be very similar. If this is not the case in a given application scenario then the task of finding coreferent records might be hopeless. Comparison vectors are computed for pairs of records from $\mathbf{X}$ to summarize evidence of whether they refer to the same entity. For record pair $(i,j)$ we compare each field $f=1,\dots,F$ by computing some similarity measure $\mathcal{S}_{f}(i,j)$ , which depends on the type of information contained by each field. For unordered categorical fields like sex or race, $\mathcal{S}_{f}$ can simply be a binary comparison checking whether the records agree in that field. For more structured fields, $\mathcal{S}_{f}$ should be able to capture partial agreements. For example, in the case of strings such as names or addresses, $\mathcal{S}_{f}$ should correspond to a string metric, such as the Levenshtein edit distance, the Jaro–Winkler score, or any other (see Bilenko et al., 2003; Elmagarmid, Ipeirotis and Verykios, 2007). Some of these comparisons will be missing, since if field $f$ is missing for a record $i$ , then $\mathcal{S}_{f}(i,j)$ will be missing regardless of whether field $f$ is observed for record $j$ .

In principle, we could define the comparison vectors using the original similarity values $\mathcal{S}_{f}(i,j)$ , $f=1,\dots,F$ , but the direct modeling of the $\mathcal{S}_{f}(i,j)$ ’s requires customized models per type of comparison, because the outputs of these similarity measures lie in different spaces, depending on the type of field being compared. Instead, Sadinle (2014) followed Winkler (1990) in dividing the range of each similarity measure $\mathcal{S}_{f}$ into $L_{f}+1$ intervals $I_{f0},I_{f1},\dots,I_{fL_{f}}$ , that represent different levels of disagreement. In this construction we associate the interval $I_{f0}$ with the highest level of agreement, including no disagreement, and the last interval, $I_{fL_{f}}$ , with the highest level of disagreement, which depending on the field may represent complete or strong disagreement. For records $i$ and $j$ , and field $f$ , we define

[TABLE]

As the value of $\gamma^{f}_{ij}$ increases, so does the disagreement between records $i$ and $j$ with respect to field $f$ . The possible values of $\gamma^{f}_{ij}$ simply represent the categories of an ordinal variable. We then define the comparison vector $\boldsymbol{\gamma}_{ij}=(\gamma_{ij}^{1},\dots,\gamma_{ij}^{f},\dots,\gamma_{ij}^{F})$ for records $i$ and $j$ . Building comparison data as ordinal categorical variables facilitates modeling since we can use a generic model for any type of comparison, as long as its values are categorized in a meaningful way.

A number of traditional record linkage and duplicate detection approaches use pairwise comparisons $\boldsymbol{\gamma}_{ij}$ , but they output independent pairwise decisions on the matching/coreference status of pairs of records (Fellegi and Sunter, 1969; Winkler, 1988; Jaro, 1989; Larsen and Rubin, 2001), which then need to be reconciled in some ad-hoc manner as they may not be compatible with one another. Sadinle (2014) modified comparison-based approaches to directly target $\mathbf{Z}$ rather than pairwise matching decisions. Letting $\boldsymbol{\Gamma}(\mathbf{X})$ denote the comparison data for all record pairs, the approach of Sadinle (2014) corresponds to a model $P(\boldsymbol{\Gamma}(\mathbf{X})\mid\mathbf{Z})$ which along with a prior $p(\mathbf{Z})$ allows us to obtain a posterior $p(\mathbf{Z}\mid\boldsymbol{\Gamma}(\mathbf{X}))$ .

The model for the comparison data $\boldsymbol{\Gamma}(\mathbf{X})$ presented by Sadinle (2014) assumes that $\boldsymbol{\gamma}_{ij}$ is a realization of a random vector $\boldsymbol{\Gamma}_{ij}$ such that:

[TABLE]

In this model, $G_{1}$ and $G_{0}$ represent the distributions of the comparison vectors among coreferent and non-coreferent pairs, respectively.

Sadinle (2014) parameterized $G_{1}$ as

[TABLE]

which is obtained under conditional independence of the comparison fields, and ignorability of the missingness in the comparison vectors. $I_{obs}(\gamma_{ij}^{f})$ indicates whether $\gamma_{ij}^{f}$ is observed, $\Phi_{1}=(\boldsymbol{m}_{1},\dots,\boldsymbol{m}_{F})$ , with $\boldsymbol{m}_{f}=(m_{f0},\dots,m_{f,L_{f}-1})$ , where $m_{f0}=P_{1}(\Gamma^{f}_{ij}=0)$ , and $m_{fl}=P_{1}(\Gamma^{f}_{ij}=l\mid\Gamma^{f}_{ij}>l-1)$ for $0<l<L_{f}$ . A similar expression can be obtained for $P_{0}(\boldsymbol{\gamma}^{obs}_{ij}\mid\Phi_{0})$ in terms of parameters $\Phi_{0}=(\boldsymbol{u}_{1},\dots,\boldsymbol{u}_{F})$ .

The parameterization in terms of the sequential conditional probabilities $m_{fl}$ facilitates prior specification. The parameter $m_{fl}=P_{1}(\Gamma^{f}_{ij}=l\mid\Gamma^{f}_{ij}>l-1)$ represents the probability of observing disagreement level $l$ in the comparison $f$ , among two coreferent records with disagreement larger than level $l-1$ . Unless we expect field $f$ in one of these two datafiles to be highly unreliable, we would a priori expect each $m_{fl}$ to be fairly close to 1. For example, for $l=0$ this is simply $m_{f0}=P_{1}(\Gamma^{f}_{ij}=0)$ , which represents the marginal probability of disagreement level zero, which encodes full or a high degree of agreement, and so $m_{f0}$ should be high if the field $f$ in these two datafiles does not contain too many errors. For $l=1$ , we have $m_{f1}=P_{1}(\Gamma^{f}_{ij}=1\mid\Gamma^{f}_{ij}>0)$ , which represents the probability of observing disagreement level one in the comparison $f$ , among coreferent record pairs with disagreement larger than what is captured by the level zero. If the number of disagreement levels is greater than two, we can think of level one of disagreement as a type of mild disagreement, meaning that if we expect the amount of error to be relatively small, then $m_{f1}$ should be concentrated around values close to one. As we consider other parameters $m_{fl}$ for levels $l>2$ , it is easy to see that they should also be close to one, if field $f$ does not contain too many errors. In general, we can therefore think of using the priors $m_{fl}\sim\text{Uniform}[\lambda_{fl},1]$ , for some prior truncation points $0<\lambda_{fl}<1$ , such that the less accurate we believe field $f$ is, the lower the value for $\lambda_{fl}$ . More generally, we could take truncated beta priors, but here we focus on specifying our prior beliefs through these truncation points $\lambda_{fl}$ .

It is more difficult to incorporate prior information on the probabilities $u_{fl}=P_{0}(\Gamma^{f}_{ij}=l\mid\Gamma^{f}_{ij}>l-1)$ , since the distribution of the disagreement levels among non-coreferent record pairs may be quite different depending on the characteristics of the fields. For example, a categorical field with a highly frequent category will lead to a high probability of $\Gamma^{f}_{ij}=0$ even for non-coreferent record pairs, but a field like phone number or address will lead to small probabilities of agreement among non-coreferent record pairs. For simplicity we therefore take each $u_{fl}\sim\text{Uniform}(0,1)$ .

The approach of Sadinle (2014) heavily relies on being able to reduce the set of candidate coreferent record pairs on which vectors of comparisons are computed. By using simple rules that can efficiently identify non-coreferent pairs we seek to avoid comparing all the $r(r-1)/2$ record pairs when $r$ is large. For example, if the datafiles contain a categorical field deemed to be error-free, one can simply take records disagreeing on that field as being non-coreferent. This simple approach is known as blocking. Unfortunately, in many applications all fields may be subject to error, and therefore we need to devise other ways of filtering non-coreferent records. An alternative is to exploit prior knowledge on the kinds of errors that would be unlikely for a certain field, thereby declaring as non-coreferent any record pair that disagrees more than a predefined amount in that field. There also exist other more sophisticated techniques to detect sets of non-coreferent pairs, which are extensively surveyed by Christen (2012).

After this initial filtering step, the set $\mathcal{P}$ comprises the remaining candidate coreferent record pairs, on which we compute comparison vectors. Using these comparison vectors we define additional rules to fix record pairs as non-coreferent. For instance, strong disagreements in both given and family names, or in other combination of fields may be a robust indication of the pair being non-coreferent. The final set of candidate coreferent pairs is $\mathcal{C}\subseteq\mathcal{P}$ .

The possible coreference partitions are finally constrained to the set $\mathcal{Z}=\{\mathbf{Z}:Z_{i}\neq Z_{j},\ \forall\ (i,j)\notin\mathcal{C}\}$ , that is, any partition that puts together record pairs already declared as non-coreferent is unfeasible. The approach of Sadinle (2014) relies on $\mathcal{Z}$ being much smaller than the set of all possible partitions, which is why we heavily rely on being able to obtain a small set of candidate pairs $\mathcal{C}$ . The comparison vectors of the pairs in $\mathcal{P}\setminus\mathcal{C}$ are used in the model but fixed as non-coreferent pairs, that is, they never get assigned the same label in $\mathbf{Z}$ . The prior distribution on $\mathbf{Z}$ used by Sadinle (2014) was derived from a uniform distribution on partitions constrained to partitions consistent with $\mathcal{Z}$ . A simple way to obtain the flat prior on partitions from a prior for $\mathbf{Z}$ is by assigning equal probability to each of the $r!/(r-n)!$ labelings of a partition with $n$ groups, which leads to the prior on partition labelings $p(\mathbf{Z})\propto[(r-n(\mathbf{Z}))!/r!]I(\mathbf{Z}\in\mathcal{Z})$ , where $n(\mathbf{Z})$ measures the number of different labels in labeling $\mathbf{Z}$ .

Finally, Sadinle (2014) developed a Gibbs sampler to obtain draws from the posterior distribution of $\mathbf{Z}$ .

2.3 A Practical Comparison of Bayesian Partitioning Record Linkage Approaches

Both direct-modeling and comparison-based approaches have advantages and disadvantages when compared to one another. One can argue that direct-modeling approaches are more principled, as they directly model the records in the datafiles/lists. Instead, comparison-based approaches merely model comparisons between pairs of records. This advantage of direct-modeling approaches can also be seen as a disadvantage, as the lists $\mathbf{X}$ may contain some combination of fields that are difficult to directly model like family and given names, dates, addresses, phone numbers, etc. Writing $P(\mathbf{X}\mid\mathbf{Z})$ requires proposing models for such fields, which requires modeling how such information gets corrupted. Comparison-based approaches have an advantage here, because any type of field can be used to construct the comparison data, as long as the comparisons are meaningful for the fields at hand. Therefore, models $P(\boldsymbol{\Gamma}(\mathbf{X})\mid\mathbf{Z})$ will often be much simpler than models $P(\mathbf{X}\mid\mathbf{Z})$ .

In this article we will use the comparison-based approach of Sadinle (2014), which is better suited to the data from El Salvador. Direct-modeling approaches currently do not handle missing data and need computational speed-ups. For example, the approach of Steorts (2015) as implemented in the R package blink takes 8.4 hours to compute 30,000 MCMC iterations with a file of size 500 included in the blink package, and requires around 10,000 iterations to reach convergence. By contrast, the approach of Sadinle (2014) can take advantage of fixing obvious non-coreferent record pairs as non-coreferent, which leads to a much faster Gibbs sampler. With the file of size 500 included in the blink package, after fixing record pairs with high Levenshtein distance in first or last name as non-coreferent, we obtain 15,052 candidate coreferent pairs. 30,000 iterations of the Gibbs sampler of Sadinle (2014) run in one hour, but convergence is achieved in less than 10 iterations. This comparison was done on a laptop with a processor Intel Core i7-4900MQ.

Regardless of what approach one uses, the critical requirement needed in this article is that the record linkage approach provides a set of draws $\mathbf{Z}^{(1)},\mathbf{Z}^{(2)},\dots,\mathbf{Z}^{(d)}$ from a posterior $p(\mathbf{Z}\mid\boldsymbol{\Gamma}(\mathbf{X}))$ , $\boldsymbol{\Gamma}(\mathbf{X})$ being the comparison data in the case of comparison-based approaches, or from $p(\mathbf{Z}\mid\mathbf{X})$ in the case of direct-modeling approaches.

3 Population Size Estimation

To estimate the total number of units or individuals in a closed population, a number of techniques rely on the availability of $K$ incomplete lists/samples drawn from the population. The name capture-recapture comes from applications in population ecology where the goal is to estimate animal abundance. In that context the technique consists in drawing $K$ samples from the population in a sequential manner while keeping track of the individuals’ inclusion patterns, that is, which individuals have been included in which samples (see, e.g. Pollock, 2000). In the context of estimating the size of human populations, the $K$ samples often come from record systems which are not necessarily collected in a sequential manner, but are represented by datafiles or lists containing (partially) identifying information on the individuals. In that context the term multiple-systems estimation is often preferred (see, e.g. Bird and King, 2018). The discussion in this article applies to capture-recapture or multiple-systems estimation models with sufficient statistics that depend only on the inclusion patterns of the different individuals (e.g., Fienberg, 1972; Bishop, Fienberg and Holland, 1975; Castledine, 1981; George and Robert, 1992; Madigan and York, 1997; Fienberg, Johnson and Junker, 1999; Manrique-Vallier, 2016).

Let an inclusion pattern be represented by a vector $\boldsymbol{h}=(h_{1},\dots,h_{K})$ in $\{0,1\}^{K}$ , where $h_{k}=1$ indicates inclusion in the record-system $k$ . Let $n_{\boldsymbol{h}}$ represent the number of individuals with inclusion pattern $\boldsymbol{h}$ . The inclusion patterns’ frequencies can be organized in a contingency table $\boldsymbol{n}^{*}=\{n_{\boldsymbol{h}}\}_{\boldsymbol{h}\in\{0,1\}^{K}}$ . Notice that we do not observe the number of individuals missed by all record-systems, that is, $n_{00\dots 0}$ is missing, and so we let $\boldsymbol{n}=\{n_{\boldsymbol{h}}\}_{\boldsymbol{h}\in\{0,1\}^{K}\setminus\{0\}^{K}}$ represent the observed counts. For example, with three record-systems we denote the observed frequencies of the different inclusion patterns as $\boldsymbol{n}=\{n_{111},n_{011},n_{101},n_{001},n_{110},n_{010},n_{100}\}$ , where, for example, $n_{101}$ represents the number of individuals included in record-systems one and three but not in record-system two.

For a given individual we can think of their inclusion pattern $\boldsymbol{h}$ as a realization of a $K$ -variate binary vector such that $P(\boldsymbol{h}\mid\boldsymbol{\theta})=\theta_{\boldsymbol{h}}$ , with the vector $\boldsymbol{\theta}=\{\theta_{\boldsymbol{h}}\}_{\boldsymbol{h}\in\{0,1\}^{K}}$ providing the probability of each inclusion pattern. Let $\boldsymbol{\theta}(m)$ denote the capture probabilities as dictated by a model $m$ . Given that there are $N=\sum_{\boldsymbol{h}\in\{0,1\}^{K}}n_{\boldsymbol{h}}$ individuals in the population, under the assumption that their inclusion patterns are independent and identically distributed, we have that the joint distribution of the contingency table $\boldsymbol{n}^{*}$ is multinomial with probability mass function

[TABLE]

Notice that since for given $N$ and $\boldsymbol{n}$ we can obtain $n_{00\dots 0}=\linebreak N-\sum_{\boldsymbol{h}\in\{0,1\}^{K}\setminus\{0\}^{K}}\ n_{\boldsymbol{h}}$ , we can write $P(\boldsymbol{n}\mid N,\boldsymbol{\theta}(m),m)=P(\boldsymbol{n}^{*}\mid N,\boldsymbol{\theta}(m),m)$ .

Given a model $m$ and a prior on the population size $p(N)$ , we are interested in obtaining a posterior distribution

[TABLE]

where

[TABLE]

for a prior on the model parameters $p(\boldsymbol{\theta}(m)\mid m)$ , assuming that $N$ and $\boldsymbol{\theta}(m)$ are independent a priori.

As mentioned before, a number of approaches for population size estimation fit into this description, but for simplicity we only describe the approach based on decomposable graphical models of Madigan and York (1997) and the approach based on mixture models of Manrique-Vallier (2016).

3.1 Approaches Based on Graphical Models

It is especially convenient to work with models and priors that allow a closed form for $P(\boldsymbol{n}\mid N,m)$ in (3.3). Madigan and York (1997) present one class of graphical models that have this characteristic. Probabilistic graphical models (see, e.g., Lauritzen, 1996; Edwards, 2000) provide a way of encoding the set of conditional independencies of a multivariate distribution into a graph. In a graphical model, each random variable is represented by a node in a graph, and two nodes are joined by an edge if the variables are conditionally dependent given a set of other variables. In the context of this article a graphical model captures conditional independencies between the binary variables that indicate inclusion of the individuals into the lists $\mathbf{X}_{1},\dots,\mathbf{X}_{K}$ . A graphical model $m$ will depend on a set of parameters $\boldsymbol{\theta}(m)$ that satisfy certain constraints dictated by the independencies in the graph. Madigan and York (1997) further restrict their attention to the class of decomposable graphical models, which are characterized by their independence graph being chordal (triangulated). The first two columns of Table 5 present all non-saturated graphical models for three samples/lists, in which case all happen to be decomposable. Dawid and Lauritzen (1993) introduced the hyper-Dirichlet distributions, which can be used as priors for the parameters $\boldsymbol{\theta}(m)$ in such models, and lead to closed formulae for $P(\boldsymbol{n}\mid N,m)$ . For the sake of this article, it is enough to say that the parameters of a hyper-Dirichlet prior can be specified from thinking on a table $\boldsymbol{\alpha}=\{\alpha_{\boldsymbol{h}}\}_{\boldsymbol{h}\in\{0,1\}^{K}}$ of “prior counts” of the same size as $\boldsymbol{n}^{*}$ . In this document we will think of all the entries of $\boldsymbol{\alpha}$ being a constant $\alpha$ , in particular $\alpha=1$ . Given a hyper-Dirichlet prior for the model parameters $\boldsymbol{\theta}(m)$ , and if $N$ and $\boldsymbol{\theta}(m)$ are independent a priori, Madigan and York (1997) show that

[TABLE]

where

[TABLE]

In this expression $\{C_{l}\}_{l=1}^{L}$ represents the set of (maximal) cliques, $\{S_{l}\}_{l=2}^{L}$ the set of separators (including multiplicities), and $Q$ the number of connected components of the independence graph of model $m$ . For a given subset of nodes $A$ , $\boldsymbol{h}_{A}$ represents an inclusion pattern constrained to the variables in $A$ . Finally, $\alpha_{\boldsymbol{h}_{A}}=\sum_{\boldsymbol{h}^{\prime}:\boldsymbol{h}^{\prime}_{A}=\boldsymbol{h}_{A}}\alpha_{\boldsymbol{h}^{\prime}}$ . (Notice that Equation (3.5) appears with $Q=1$ in Madigan and York (1997), but if we do not take the number of connected components into account then $P(\boldsymbol{n}\mid N,m)$ does not add up to 1).

With the methodology of Madigan and York (1997) we can also take into account the uncertainty on the model for population size estimation as

[TABLE]

for a prior $p(m)$ on a finite number of models. In this article we take $p(m)$ to be uniform over the class of models. For three lists, there are seven non-saturated decomposable graphical models, and so $p(m)=1/7$ .

3.2 Approaches Based on Mixture Models

An alternative model $m$ for the probabilities of the inclusion patterns $P(\boldsymbol{h}\mid\boldsymbol{\theta}(m))=\theta_{\boldsymbol{h}}(m)$ is obtained by assuming the existence of strata $s=1,\dots,S$ , such that inside each of them the inclusion indicators are independent of each other, that is, $P(\boldsymbol{h}\mid s,\boldsymbol{\theta}_{s})=\prod_{k=1}^{K}\theta_{sk}^{h_{k}}(1-\theta_{sk})^{1-h_{k}}$ , where $P(h_{k}=1\mid s,\boldsymbol{\theta}_{s})=\theta_{sk}$ is the probability of an individual being included in list $k$ given that it belongs to stratum $s$ . Each stratum has a probability $\pi_{s}$ , $\sum_{s=1}^{S}\pi_{s}=1$ . The probability of the inclusion patterns under this mixture model approach is therefore $\theta_{\boldsymbol{h}}(m)=\sum_{s=1}^{S}\pi_{s}\prod_{k=1}^{K}\theta_{sk}^{h_{k}}(1-\theta_{sk})^{1-h_{k}}$ , which can then be plugged into (3.1).

Manrique-Vallier (2016) used the priors $\theta_{sk}\sim\text{Beta}(1,1)$ , and expressed each $\pi_{s}=V_{s}\prod_{t<s}(1-V_{t})$ where each $V_{t}\sim\text{Beta}(1,\alpha)$ , $t=1,\dots,S-1$ , $V_{S}=1$ , and $\alpha\sim\text{Gamma}(.25,.25)$ . This construction is known as a finite-dimensional stick-breaking prior (Ishwaran and James, 2001) and it encourages most of the mass to be concentrated in the initial $\pi_{s}$ ’s, which consequently makes the choice of $S$ irrelevant as long as it is relatively large. These priors would in principle allows us to integrate $\boldsymbol{\theta}(m)$ as in (3.3), and then obtain (3.2), but in this case these integrals are not easily computable, which is why Manrique-Vallier (2016) developed an MCMC algorithm to obtain posterior samples from (3.2) under this mixture model approach. For further details on this approach see Manrique-Vallier (2016).

4 Linkage-Averaged Population Size Estimation

4.1 Derivation of Inclusion Patterns

We start by explaining how to compute the incomplete contingency table $\boldsymbol{n}$ from a given coreference partition labeling $\mathbf{Z}$ . Let $n$ be the number of different labels in $\mathbf{Z}$ , that is, $n$ represents the number of different individuals that are included in the $K$ datafiles/lists according to the coreference partition represented by $\mathbf{Z}$ . Without loss of generality we can think of the labels in $\mathbf{Z}$ to be $1,\dots,n$ . If this is not the case we can simply obtain an equivalent labeling that uses those labels. Now, for each different label $z=1,\dots,n$ , let

[TABLE]

The vector $\mathbf{H}_{k}=(h_{1k},\dots,h_{nk})$ contains the indicators of whether each of the $n$ individuals is included in the $k$ th datafile. The contingency table $\boldsymbol{n}$ is simply obtained as a cross-classification of these $K$ inclusion vectors. We write $\boldsymbol{n}(\mathbf{Z})$ to emphasize that the contingency table $\boldsymbol{n}$ is a function of a coreference partition represented by $\mathbf{Z}$ .

4.2 Linkage-Averaged Population Size Estimation

The output that we use from the record linkage and duplicate detection stage is a posterior sample $\mathbf{Z}^{(1)},\dots,\mathbf{Z}^{(d)}$ from a posterior $p(\mathbf{Z}\mid\mathbf{X})$ or $p(\mathbf{Z}\mid\boldsymbol{\Gamma}(\mathbf{X}))$ , as exemplified in Figure 1.

For each of these draws, we can compute the implied contingency tables containing the frequencies of the inclusion patterns $\boldsymbol{n}(\mathbf{Z}^{(1)}),\dots,\boldsymbol{n}(\mathbf{Z}^{(d)})$ . For each of these contingency tables, we can obtain a posterior distribution on the population size using one of the capture-recapture models in Section 3, that is, we can obtain $p(N\mid\boldsymbol{n}(\mathbf{Z}^{(1)})),\dots,p(N\mid\boldsymbol{n}(\mathbf{Z}^{(d)}))$ , or a Monte Carlo approximation of these. The linkage-averaged posterior of $N$ , $p_{\textsc{la}}(N)$ , defined formally in the next Section, is approximated as

[TABLE]

when each $p(N\mid\boldsymbol{n}(\mathbf{Z}^{(t)}))$ is available in closed form, as with the methodology of Madigan and York (1997). When this is not the case, as with the approach of Manrique-Vallier (2016), we use a random sample $N^{(1,t)},\dots N^{(b,t)}\sim p(N\mid\boldsymbol{n}(\mathbf{Z}^{(t)}))$ , for each $t=1,\dots,d$ , and use the approximation

[TABLE]

The formal justification for this linkage-averaged posterior is given next.

4.3 Bayesian Justification of Linkage-Averaging

Our strategy for incorporating linkage uncertainty into population size estimation can be derived from a proper Bayesian analysis under two reasonable conditions.

Condition 1.

Our beliefs on $\mathbf{Z}$ are represented by the posterior distribution $p_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})$ , coming from a model for record linkage and duplicate detection, composed by a likelihood function $\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})$ and a prior $p(\mathbf{Z})$ .

For our discussion, the linkage model can be one of the ones presented in Section 2, but we only require it to provide a proper posterior distribution on coreference partitions. For simplicity we use the notation $p_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})$ and $\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})$ to represent models that either directly model the fields in the lists $\mathbf{X}$ or that use comparison data, although for the latter the notation $p_{\textsc{l}}(\mathbf{Z}\mid\boldsymbol{\Gamma}(\mathbf{X}))$ and $\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\boldsymbol{\Gamma}(\mathbf{X}))$ would be more appropriate, with $\boldsymbol{\Gamma}(\mathbf{X})$ representing the comparison data built from the records in $\mathbf{X}$ .

Condition 2.

If we knew the true value of $\mathbf{Z}$ , our beliefs on the population size $N$ would be represented by the posterior distribution $p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))$ , obtained from a capture-recapture model composed by a likelihood function $\mathcal{L}_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))$ and a prior $p(N)$ .

This condition simply indicates how we would obtain inferences on $N$ if we knew which records were coreferent. Note that the likelihood function $\mathcal{L}_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))$ should come from a capture-recapture model that has the frequencies of the inclusion patterns $\boldsymbol{n}(\mathbf{Z})$ , or a function of them, as sufficient statistics, such as those discussed in Section 3. In particular, notice that the capture-recapture model could involve only a subset of the $K$ datafiles being linked, that is, it could depend on inclusion patterns only for a subset of the $K$ datafiles. This scenario could arise in cases where some of the datafiles being linked arise from collection mechanisms that make the assumptions of the capture-recapture model seem implausible, such as lists that target members of the population with a distinctive trait and therefore lead to zero probability of inclusion for individuals without the trait.

Given the setup of Conditions 1 and 2, it seems natural to compute

[TABLE]

as a way of propagating the linkage uncertainty into population size estimation. We refer to $p_{\textsc{la}}(N)$ as the linkage-averaged population size posterior. Here, $p_{\textsc{la}}(N)$ corresponds to the expected posterior distribution of the population size, averaging with respect to the posterior distribution of the coreference partition. This procedure is intuitively appealing, $p_{\textsc{la}}(N)$ has a clear interpretation, and we now show that $p_{\textsc{la}}(N)$ also corresponds to a proper posterior distribution.

In principle, if we want to draw inferences jointly on $N$ and $\mathbf{Z}$ given $\mathbf{X}$ , we need to specify a joint prior $p(N,\mathbf{Z})$ . From Condition 2, we have that the distribution $p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))$ would contain our belief on the population size if $\mathbf{Z}$ was known. Similarly, from Condition 1 we have that the prior $p(\mathbf{Z})$ contains our prior beliefs on $\mathbf{Z}$ . Therefore, Conditions 1 and 2 imply the joint prior $p(N,\mathbf{Z})=p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))p(\mathbf{Z})$ .

Theorem 4.1 (Bayesian propriety of linkage-averaged population size posterior).

$p_{\textsc{la}}(N)$ * is the marginal posterior distribution of $N$ under the likelihood of the linkage model $\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})$ and the joint prior $p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))p(\mathbf{Z})$ .*

Proof. The joint posterior of $N$ and $\mathbf{Z}$ is $p(N,\mathbf{Z}\mid\mathbf{X})\propto\linebreak\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))p(\mathbf{Z})$ , where the inverse of the proportionality constant is $\sum_{\mathbf{Z}}\sum_{N}\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))p(\mathbf{Z})=\sum_{\mathbf{Z}}\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})p(\mathbf{Z})$ , since $\sum_{N}p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))=1$ . Given that $p_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})\propto\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})p(\mathbf{Z})$ with the inverse of the proportionality constant being $\sum_{\mathbf{Z}}\mathcal{L}_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})p(\mathbf{Z})$ , we can therefore write $p(N,\mathbf{Z}\mid\mathbf{X})=p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))p_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})$ . Then, $p(N\mid\mathbf{X})=\sum_{\mathbf{Z}}p(N,\mathbf{Z}\mid\mathbf{X})=\sum_{\mathbf{Z}}p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))p_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})=p_{\textsc{la}}(N)$ . $\Box$

Furthermore, the total variability represented by $p_{\textsc{la}}(N)$ can be decomposed as

[TABLE]

where the first term on the right hand side can be seen as the contribution of the linkage uncertainty on the population size variability, and the second term summarizes the variability that is intrinsic to Bayesian approaches for estimating $N$ .

In practice, we generally will have to approximate $p_{\textsc{la}}(N)$ and the variance components in (4.3) using posterior draws from $p_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})$ , as explained in Section 4.2 and Equations (4.1) and (4.2).

4.4 Linkage and Capture-Recapture Model Uncertainty

The capture-recapture model used above could be, for example, an individual decomposable graphical model, as presented in Section 3.1, but it could also be the average of them as in Madigan and York (1997), in which case $p_{\textsc{c}}(N\mid\boldsymbol{n}(\mathbf{Z}))$ would be given by (3.6). In fact, in Section 5 we present an application of our linkage-averaging strategy using the model of Madigan and York (1997) and the Bayesian partitioning approach of Sadinle (2014). In fact, under the methodology of Madigan and York (1997) we can actually write the linkage-averaged posterior of $N$ as

[TABLE]

where $m$ ranges over the non-saturated decomposable graphical models for the contingency table, and

[TABLE]

with $\mathcal{L}_{m}(N\mid\boldsymbol{n}(\mathbf{Z}))=P(\boldsymbol{n}(\mathbf{Z})\mid N,m)$ as given by (3.1). Expression (4.4) explicitly shows the contribution of coreference uncertainty and capture-recapture model uncertainty on the overall population size posterior. In fact, we can decompose the overall posterior variance as follows

[TABLE]

where the first term of the sum can be directly attributed to linkage uncertainty, the second term to model uncertainty in the population size estimation stage, and the remaining variability is intrinsic from Bayesian approaches to estimate $N$ .

We note that in principle we could also average the models of Madigan and York (1997) with that of Manrique-Vallier (2016), or with any other that satisfies the conditions discussed in Section 3, but we do not pursue that here.

4.5 Implications for Model Exploration and Data Confidentiality Protection

The strategy presented here allows the linkage and the population size estimation to be carried out in two separate stages, while still leading to proper Bayesian inferences. This has important practical implications, as the linkage can be performed as if it was the final goal, the population size estimation is standard given each coreference partition, and the combination of the two stages through linkage-averaging is simple.

Regarding model exploration, in principle an analyst would have to obtain a new posterior $p(N,\mathbf{Z}\mid\mathbf{X})$ for each different capture-recapture model being considered. For example, the approaches of Tancredi and Liseo (2011) and Liseo and Tancredi (2011) rely on a specific capture-recapture model in the case of two lists. Under our approach, however, we can reuse the results from the linkage step to obtain different linkage-averaged estimates for each different capture-recapture model. Theorem 4.1 implies that each such linkage-averaged posterior corresponds to a proper posterior distribution. Not having to re-do the linkage for each different capture-recapture model is certainly an important practical advantage.

Our two-stage strategy also indicates that the linkage and the population size estimation can be done by different analysts. This is relevant in contexts where one needs to protect the confidentiality of the lists and the privacy of the individuals, given that the linkage can be carried out by a small trusted team, and then the linkage results, in the form of draws from $p_{\textsc{l}}(\mathbf{Z}\mid\mathbf{X})$ , can be transferred to subsequent analysts without having to reveal personally identifiable information used for the linkage.

5 Estimating Mortality Levels in the Salvadoran Civil War

A common goal in quantitative human rights research is to estimate the total number of civilian casualties that occurred during a war. For this purpose, multiple-systems estimation is frequently used after different lists of casualties are combined via record linkage techniques, but typically the linkage uncertainty is ignored. Lum, Price and Banks (2013) provide a comprehensive review of such applications. In this section we study a case from the civil war that the Central American republic of El Salvador endured between 1980 and 1991. Our goal is to combine three data sources on civilian killings that were collected by three different organizations, and then use those results to obtain different multiple-systems estimates of the total number of civilian killings. We focus on the region (departamento) of the capital city, San Salvador.

5.1 Description of the Datafiles

The first two datafiles that we consider contain reports on civilian killings collected during the civil war. The first data source was put in electronic form by the Los Angeles-based nongovernmental organization El Rescate, from reports that had been published periodically during the civil war by the project Tutela Legal of the Archdiocese of San Salvador (Howland, 2008). We refer to this source as El Rescate / Tutela Legal (ER-TL, 1364 records from San Salvador). The second data source comes from the Salvadoran Human Rights Commission (Comisión de Derechos Humanos de El Salvador — CDHES, 285 records from San Salvador), which directly collected testimonials on human rights violations between 1979 and 1991 (Ball, 2000). For both datafiles, the characteristics of their collection make us believe that they should contain only small amounts of duplication, if any (Sadinle, 2017).

The third datafile was collected by the United Nations Truth Commission for El Salvador (UNTC, 440 records from San Salvador), between 1992 and 1993, after the civil war ended (Commission on the Truth for El Salvador, 1993). Given that most of the reports to the UNTC refer to killings that occurred several years before 1992, it is reasonable to expect the information in this datafile to be less reliable compared with ER-TL and CDHES, since individuals reporting casualties might not have had accurate recollections of the time and place of the events. Non-trivial duplicates arise in this datafile from reports of multiple family members and acquaintances of a single victim.

5.2 Record Linkage and Duplicate Detection

5.2.1 Datafile Standardization, Filtering Non-Coreferent Pairs, and Comparison Data

The three datafiles used in this section have the following fields in common: given and family names, date and place/municipality of death. Our standardization of names and construction of comparison data are as described in the application of Sadinle (2014). Table 1 summarizes the construction of levels of disagreement. Since the datafiles are small enough, we computed comparison data for all $\binom{2,089}{2}=$ 2,180,916 record pairs (the set $\mathcal{P}$ ). We then formed the set $\mathcal{C}$ of candidate coreferent pairs by fixing as non-coreferent all the pairs that have disagreement level three in either given name or family name. This leads to only $|\mathcal{C}|=699$ candidate pairs, which involve only 775 records.

5.2.2 Prior Specification

We followed the general guidelines presented in Section 2.2 and used uniform priors on $[0,1]$ for all the $u_{fl}$ parameters. For the $m_{fl}$ parameters, we used flat priors in the intervals $[\lambda_{fl},1]$ for the truncation points given in Table 2. These priors indicate our belief that coreferent pairs are very likely to have exact agreements, although we still expect a considerable amount of error in the fields. Finally, the prior for the field day of death has low truncation points in general, since we believe this field to be unreliable.

5.2.3 Gibbs Sampler Implementation

We ran 10,000 iterations of the Gibbs sampler of Sadinle (2014). The runtime using an implementation in R with parts written in C language was of 35 seconds on a laptop with a processor Intel Core i7-4900MQ. Convergence of the chain was checked using functions of the partitions. We found the number of killings reported 1, 2, and 3 times according to each partition in the chain. The traceplots of these chains (not shown here) indicate that they seem to have converged rather quickly, and their autocorrelation functions indicate that there are not large autocorrelations in the chain. Similar results were obtained when we explored the number of different killings in the datafiles according to the partitions in the chain. Based on these diagnostics we discarded the first 1,000 iterations and kept one draw each five iterations. After this thinning, the autocorrelation plots (not shown here) did not suggest the existence of remaining autocorrelations of any order. For each of the previously explored chains we also computed Geweke’s convergence diagnostic as implemented in the R package coda (Plummer et al., 2006). The Geweke’s Z-scores indicated that it is reasonable to treat these chains as drawn from their stationary distributions. We also explored the marginal probabilities that pairs of records are coreferent for the pairs in the set $\mathcal{C}$ of candidate pairs. For each pair in $\mathcal{C}$ , and for each partition in the chain, we checked whether the pair appeared together in the partition. For each of these binary chains we computed Geweke’s convergence diagnostic, and we found that all the Z-scores range around the usual values of a standard normal random variable, which indicates that it is reasonable to assume that these chains were obtained from their stationary distributions.

5.3 Linkage-Averaged Posterior Estimates of the Total Number of Killings

The draws from the posterior of the coreference partition can be directly used to obtain inferences on different quantities of interest. For example, computing the size of each partition gives us posterior draws of the number of different reported killings, which in this case lead to a 99% credible interval of [1892, 1906], and a posterior mean of 1900. This can be seen as an estimated lower bound on the total number of killings. In Table 3 we also present the marginal posterior distribution of number of killings following each of the different inclusion patterns, $n_{111},\dots,n_{100}$ . The remainder of the section is devoted to using the posterior draws of the coreference partition to derive estimates of the total number of killings using different capture-recapture models.

5.3.1 Two-Sample Estimates

In Section 4 we mentioned that the subsequent capture-recapture model does not necessarily have to use all the lists combined in the linkage step. For example, the linkage step may have included datafiles whose collection make the assumptions in the capture-recapture model implausible. An example in the context of our application would be a list of the victims that belonged to a given organization; in that case, non-members of the organization would have zero probability of being included in the list, by definition.

In this section, we use the results of the linkage step to derive estimates of the population size based only on the inclusion patterns for pairs of lists. For example, using only the first two data sources to estimate the population size, we need to compute $n_{11+}(\mathbf{Z})=n_{110}(\mathbf{Z})+n_{111}(\mathbf{Z})$ , $n_{10+}(\mathbf{Z})=n_{100}(\mathbf{Z})+n_{101}(\mathbf{Z})$ , and $n_{01+}(\mathbf{Z})=n_{010}(\mathbf{Z})+n_{011}(\mathbf{Z})$ , for each coreference partition labeling $\mathbf{Z}$ in the posterior sample from the linkage step, and use these as sufficient statistics for the capture-recapture model. This modeling approach does not take advantage of the additional piece of information $n_{001}(\mathbf{Z})$ . With only two sources, we are limited to the capture-recapture model that assumes independence of the inclusion of the victims in the data sources, as the counts $n_{11+}$ , $n_{10+}$ , and $n_{01+}$ do not contain enough information to estimate this dependence. A possible alternative would be to pre-specify a degree of dependence between the sources, for example as discussed in Ericksen, Kadane and Tukey (1989), but we do not pursue that avenue here. In the model of independence, the modeling approach of Madigan and York (1997) corresponds to the approach of Castledine (1981).

In Table 4 we present summaries of each linkage-averaged posterior of $N$ obtained using the different possible pairs of datafiles for the estimation of the population size $N$ . The fifth column in that table shows the percentage contribution of the linkage variability towards the overall posterior variability of the population size, derived from (4.3). For some of the models this contribution can be quite small, meaning that in such cases obtaining a posterior estimate of the inclusion patterns’ frequencies and fixing those to estimate $N$ would lead to similar inferences compared with those from linkage-averaging. However, we only obtain this information after we compute the variance decomposition in (4.3).

5.3.2 Three-Sample Estimates from Individual Graphical Models

We now obtain estimates of $N$ based on each of the individual graphical models presented in Section 3.1. Fixing one such model to estimate $N$ could arise in a context where one can conjecture the dependence graph based on domain knowledge, such as knowledge of collaboration, affinity, or antagonism between institutions collecting the data.

We summarize the linkage-averaged posteriors obtained using each individual graphical model in Table 5. Similarly as for the two-sample estimates, we can see that the relative contribution of the linkage uncertainty towards the posterior uncertainty around $N$ can be quite small, meaning that the importance of accounting for linkage uncertainty ends up depending on the specific model. Unfortunately, there does not seem to be a way to tell in advance if the linkage uncertainty is going to have a big impact on the estimation of $N$ .

5.3.3 Three-Sample Estimates from Madigan and York (1997)

We now use the Bayesian model averaging approach of Madigan and York (1997) to estimate $N$ . For each coreference partition $\mathbf{Z}^{(1)},\dots,\mathbf{Z}^{(d)}$ , we can compute the joint posterior probability of the graphical model $m$ and the population size $N$ , $p(m,N\mid\mathbf{X},\mathbf{Z}^{(t)})$ , which we can use to derive $p(N\mid\mathbf{X},\mathbf{Z}^{(t)})$ . The gray lines in the first panel of Figure 2 represent each $p(N\mid\mathbf{X},\mathbf{Z}^{(t)})$ for $t=1,\dots,100$ , and the black line represents the linkage-averaged posterior of $N$ . The posteriors of the number of killings derived from the individual draws $\mathbf{Z}^{(1)},\dots,\mathbf{Z}^{(100)}$ are somewhat similar to each other, which indicates a small contribution of the linkage uncertainty towards the overall posterior variability of $N$ . According to the variance decomposition in (4.5), in this case 12% of the posterior variability is due to uncertainty in duplicate detection and record linkage.

The second panel in Figure 2 shows the linkage-averaged posterior of $N$ along with $p_{\textsc{la}}(m,N\mid\mathbf{X})$ , obtained from averaging $p(m,N\mid\mathbf{X},\mathbf{Z}^{(t)})$ over the posterior draws $\mathbf{Z}^{(1)},\dots,\mathbf{Z}^{(100)}$ , for the three models $m$ that have linkage-averaged posterior probabilities $p_{\textsc{la}}(m\mid\mathbf{X})>0.05$ . Denoting 1: ER-TL, 2: CDHES, and 3: UNTC, we find that the posteriors of $N$ under the models [1,3][2], [1,2][2,3], and [1,3][2,3] are concentrated around different values of $N$ , which greatly increases the posterior variability of $N$ . In fact, the variance decomposition in (4.5) tells us that in this case 77% of the posterior variability of $N$ is due to uncertainty on the graphical model for population size estimation. This seems to indicate that as long as we have a good estimate of the contingency table of inclusion patterns, ignoring the linkage uncertainty in the population size estimation would not be too harmful, at least for this application. The linkage-averaging approach leads to a posterior mean of 13,432, and a 99% credible interval of [5627, 25404].

5.3.4 Three-Sample Estimates from Manrique-Vallier (2016)

Linkage-averaging for population size estimation can be used with any Bayesian partitioning approach to record linkage and duplicate detection, and any model for population size estimation that depends only on the capture histories’ frequencies of the individuals in the lists. We now use the linkage results described in Section 5.2 obtained from the approach of Sadinle (2014), along with the population size methodology of Manrique-Vallier (2016).

For each of 100 draws $\mathbf{Z}^{(1)},\dots,\mathbf{Z}^{(100)}$ , we obtained an MCMC sample $N^{(1,t)},\dots,N^{(20000,t)}\sim p(N\mid\boldsymbol{n}(\mathbf{Z}^{(t)}))$ , $t=1,\dots,100$ , from the posterior obtained under the model of Manrique-Vallier (2016) using the MCMC implementation of the R package LCMCR. We then used the approximation (4.2) of the linkage-averaged posterior of $N$ . Figure 3 presents an approximation of each of $p(N\mid\boldsymbol{n}(\mathbf{Z}^{(t)}))$ , $t=1,\dots,100$ , and the approximate linkage-averaged posterior of $N$ , $p_{\textsc{la}}(N\mid\mathbf{X})$ . Under this approach we obtain a posterior 99% credible interval of [4922, 31429] and a posterior mean of 13,924. The contribution of the linkage uncertainty to the overall posterior variability is estimated at only 6.3%.

5.3.5 Estimates Using Mixture-Model Approach to Record Linkage

We finally present the results obtained using a more traditional mixture model approach to record linkage (e.g. Fellegi and Sunter, 1969; Winkler, 1988; Jaro, 1989; Larsen and Rubin, 2001; Elmagarmid, Ipeirotis and Verykios, 2007; Herzog, Scheuren and Winkler, 2007). Such models output independent pairwise coreference decisions. We implemented a mixture model version of the model of Sadinle (2014) as presented in Section 5.2. This approach classifies the record pairs in $\mathcal{C}$ into coreferent and non-coreferent pairs. The mixture model is obtained by ignoring that the match status of a record pair is given by $M_{ij}=I(Z_{i}=Z_{j})$ , and simply taking $M_{ij}\mid p\overset{iid}{\sim}\text{Bernoulli}(p)$ , $i<j$ . We used Bayesian estimation of this mixture model employing the same priors for the $m_{fl}$ and $u_{fl}$ parameters as in the application to the Salvadoran lists, and $p\sim\text{Uniform}(0,1)$ . From running a Gibbs sampler for 10,000 iterations, we obtained a posterior sample of $\{M_{ij}\}_{(i,j)\in\mathcal{C}}$ .

To obtain groups of coreferent records we used transitive closure. In the mixture model explained above, for each iteration of the Gibbs sampler we obtain a draw of $\{M_{ij}\}_{(i,j)\in\mathcal{C}}$ . For each of these iterations we apply transitive closure by setting $M_{jj^{\prime}}=1$ if $M_{ij}=M_{ij^{\prime}}=1$ for any record $i$ . The number of non-transitive triplets $(i,j,j^{\prime})$ , where only two of $M_{ij}$ , $M_{ij^{\prime}}$ , $M_{jj^{\prime}}$ equal to 1, varies between 84 and 156 across the Gibbs iterations, which is not surprising given that this model treats the $M_{ij}$ ’s as independent. Using transitive closure we obtain an ad-hoc constructed distribution of partitions of the records which we can use to implement an ad-hoc version of the linkage-averaged estimate of $N$ . We then proceeded to compute a linkage-averaged posterior using the models of Madigan and York (1997) and Manrique-Vallier (2016), which lead to posterior means of 15,636 and 14,999, and 99% credible intervals of [6389, 25532] and [5806, 35326], respectively. In this case the ad-hoc mixture model for record linkage leads to similar results as those obtained using the method of Sadinle (2014). This can be explained from the fact that both models are essentially the same, with the exception that one samples pairwise matching statuses and the other samples coreference partitions. Also, the graph induced by the set of candidate coreferent pairs $\mathcal{C}$ is quite sparse and broken into many small connected components, which constrains the clustering effect of transitive closure. Transitive closure can only group records in the same connected component obtained from $\mathcal{C}$ .

5.4 Discussion

We presented linkage-averaged estimates under individual graphical models, and linkage-averaged two-sample estimates under independence of the list inclusion indicators. These approaches lead to widely different estimates, but we simply presented them to illustrate the possibilities of linkage-averaging. The linkage-averaged estimates obtained under the models of Madigan and York (1997) and Manrique-Vallier (2016) are more plausible, as they each take into account the uncertainty on the correct model for population size estimation.

While the same linkage results, in the form of posterior draws of the coreference partition, were used for obtaining all linkage-averaged estimates, the percentage contribution of the linkage uncertainty on the overall uncertainty of $N$ varies with the capture-recapture model. For some of these approaches the contribution from the linkage is rather small, but we can only measure this after we have computed the linkage-averaged estimates.

The linkage-averaged posteriors using the models of Madigan and York (1997) and Manrique-Vallier (2016) lead to roughly the same point estimates: 13,432 and 13,924 civilian killings, respectively, in the region of San Salvador during the Salvadoran civil war. The linkage-averaged posteriors themselves, however, disagree in the tails. The disagreement on the right tail can be explained to some extent when we consider that the prior for $N$ used with the approach of Madigan and York (1997) was truncated at 30,000, whereas we did not use this truncation with the approach of Manrique-Vallier (2016) as the implementation of the R package LCMCR does not allow it. The results using Madigan and York (1997) can therefore be seen as somewhat conservative.

6 Conclusions

We presented a linkage-averaging approach to incorporate linkage uncertainty into models for population size estimation. We used Bayesian partitioning approaches for record linkage which provide posterior distributions on the coreference partition of the records coming from all the data sources. The models for population size estimation covered by our approach are those whose sufficient statistics are functions of the coreference partition alone. Under these conditions, linkage-averaging is proper in the sense that it can be derived from a proper Bayesian analysis that combines the record linkage and population size estimation models. It is important to note, however, that the success of this approach is determined by the success of its components. For example, if the record linkage model over-links or under-links, then the population size estimates will be lower or higher, respectively, with respect to what we would obtain under the correct linkage. Similarly, if the model for population size estimation is wrong, our estimates will be deficient regardless of the amount of uncertainty from the linkage stage.

The class of capture-recapture models considered here is somewhat restrictive given that, for example, they do not allow us to incorporate information on covariates that may influence capture probabilities. Traditionally, a simple way of dealing with heterogeneous inclusion probabilities in multiple-systems estimation is to stratify by characteristics that influence the inclusion probabilities, such as space and/or time. To use linkage-averaging to produce population size estimates per stratum (say, year $\times$ region) we would have to assume that the stratifying variables are recorded without error, which might be unreasonable in the context of the datafiles from El Salvador. For example, suppose two records that disagree in the stratum where they belong are coreferent according to a coreference partition. Our current methodology does not offer a way of allocating this individual to a unique stratum, nor a way to deal with the uncertainty on where it should be allocated. However, if the stratifying variables can also be used as blocking variables in the linkage step, then the linkage-averaging approach enjoys Bayesian propriety within each stratum. In this sense, approaches such as those of Steorts, Hall and Fienberg (2016), Tancredi and Liseo (2011), and Liseo and Tancredi (2011) that directly model the information in the datafiles seem promising, given that they explicitly allow us to estimate the latent true values of the individuals in the files.

We also presented an application to the combination of three lists on civilian killings from the civil war of El Salvador. In this case, the intrinsic variability of Bayesian population size estimation is much larger than the uncertainty coming from the linkage stage, but this might be different in other applications. Our analyses of the lists from El Salvador indicate that the number of civilian killings during the Salvadoran civil war in the region of San Salvador is most likely to be around 13,000–14,000, but the variability in these estimates is quite large, leading to a posterior 99% credible interval of [4922, 31429], according to the linkage-averaged estimates obtained using the methodology of Manrique-Vallier (2016). Unfortunately, we do not have a way of validating these results, as there does not even exist ground truth for validating the linkage of these datafiles.

Acknowledgements

This research is derived from the Ph.D. thesis of the author, supervised by Stephen E. Fienberg. Steve’s many interests included record linkage, population size estimation, and their application to human rights. The author therefore dedicates this article to the memory of Steve; without his support this research would not have been possible.

The author also thanks Patrick Ball and Megan Price from the Human Rights Data Analysis Group – HRDAG for providing access to the data used in this article, and Daniel Manrique-Vallier, Kristian Lum, Robin Mejia, Trivellore Raghunathan, and Thomas Brendan Murphy for helpful comments that contributed to improving the quality of this article.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Anderson and Fienberg (1999) {bbook} [author] \bauthor \bsnm Anderson, \bfnm Margo J. \binits M. J. and \bauthor \bsnm Fienberg, \bfnm Stephen E. \binits S. E. ( \byear 1999). \btitle Who Counts?: The Politics of Census-Taking in Contemporary America, \bedition Revised paperback (2001) ed. \bpublisher Russell Sage Foundation, \baddress New York. \endbibitem
2Ball (2000) {bincollection} [author] \bauthor \bsnm Ball, \bfnm Patrick \binits P. ( \byear 2000). \btitle The Salvadoran Human Rights Commission: Data Processing, Data Representation, and Generating Analytical Reports. In \bbooktitle Making the Case: Investigating Large Scale Human Rights Violations Using Information Systems and Data Analysis ( \beditor \bfnm Patrick \binits P. \bsnm Ball, \beditor \bfnm Herbert F. \binits H. F. \bsnm Spirer and \beditor \bfnm Louise \binits L. \b
3Bilenko et al. (2003) {barticle} [author] \bauthor \bsnm Bilenko, \bfnm M. \binits M., \bauthor \bsnm Mooney, \bfnm R. J. \binits R. J., \bauthor \bsnm Cohen, \bfnm W. W. \binits W. W., \bauthor \bsnm Ravikumar, \bfnm P. \binits P. and \bauthor \bsnm Fienberg, \bfnm S. E. \binits S. E. ( \byear 2003). \btitle Adaptive Name Matching in Information Integration. \bjournal IEEE Intelligent Systems \bvolume 18 \bpages 16–23. \endbibitem
4Bird and King (2018) {barticle} [author] \bauthor \bsnm Bird, \bfnm Sheila M. \binits S. M. and \bauthor \bsnm King, \bfnm Ruth \binits R. ( \byear 2018). \btitle Multiple Systems Estimation (or Capture-Re Capture Estimation) to Inform Public Policy. \bjournal Annual Review of Statistics and Its Application \bvolume 5. \endbibitem
5Bishop, Fienberg and Holland (1975) {bbook} [author] \bauthor \bsnm Bishop, \bfnm Yvonne M. \binits Y. M., \bauthor \bsnm Fienberg, \bfnm Stephen E. \binits S. E. and \bauthor \bsnm Holland, \bfnm Paul W. \binits P. W. ( \byear 1975). \btitle Discrete Multivariate Analysis: Theory and Practice. \bpublisher The MIT Press. Reprinted in 2007 by Springer, New York. \endbibitem
6Castledine (1981) {barticle} [author] \bauthor \bsnm Castledine, \bfnm B. J. \binits B. J. ( \byear 1981). \btitle A Bayesian Analysis of Multiple-Recapture Sampling for a Closed Population. \bjournal Biometrika \bvolume 68 \bpages 197–210. \endbibitem
7Christen (2012) {barticle} [author] \bauthor \bsnm Christen, \bfnm Peter \binits P. ( \byear 2012). \btitle A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication. \bjournal IEEE Transactions on Knowledge and Data Engineering \bvolume 24 \bpages 1537–1555. \endbibitem
8Dawid and Lauritzen (1993) {barticle} [author] \bauthor \bsnm Dawid, \bfnm A. P. \binits A. P. and \bauthor \bsnm Lauritzen, \bfnm S. L. \binits S. L. ( \byear 1993). \btitle Hyper Markov Laws in the Statistical Analysis of Decomposable Graphical Models. \bjournal Annals of Statistics \bvolume 21 \bpages 1272–1317. \endbibitem

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Bayesian Propagation of Record Linkage Uncertainty into Population Size Estimation of Human Rights Violations

Abstract

keywords:

1 Introduction

2 Bayesian Partitioning Record Linkage Approaches

2.1 Direct-Modeling Approaches

2.2 Comparison-Based Approaches

2.3 A Practical Comparison of Bayesian Partitioning Record Linkage Approaches

3 Population Size Estimation

3.1 Approaches Based on Graphical Models

3.2 Approaches Based on Mixture Models

4 Linkage-Averaged Population Size Estimation

4.1 Derivation of Inclusion Patterns

4.2 Linkage-Averaged Population Size Estimation

4.3 Bayesian Justification of Linkage-Averaging

Condition 1**.**

Condition 2**.**

Theorem 4.1** (Bayesian propriety of linkage-averaged population size posterior).**

4.4 Linkage and Capture-Recapture Model Uncertainty

4.5 Implications for Model Exploration and Data Confidentiality Protection

5 Estimating Mortality Levels in the Salvadoran Civil War

5.1 Description of the Datafiles

5.2 Record Linkage and Duplicate Detection

5.2.1 Datafile Standardization, Filtering Non-Coreferent Pairs, and Comparison Data

5.2.2 Prior Specification

5.2.3 Gibbs Sampler Implementation

5.3 Linkage-Averaged Posterior Estimates of the Total Number of Killings

5.3.1 Two-Sample Estimates

5.3.2 Three-Sample Estimates from Individual Graphical Models

5.3.3 Three-Sample Estimates from Madigan and York (1997)

5.3.4 Three-Sample Estimates from Manrique-Vallier (2016)

5.3.5 Estimates Using Mixture-Model Approach to Record Linkage

5.4 Discussion

6 Conclusions

Acknowledgements

Condition 1.

Condition 2.

Theorem 4.1 (Bayesian propriety of linkage-averaged population size posterior).