On the estimation of population size from a post-stratified two sample   capture-recapture data under dependence

Kiranmoy Chatterjee; Prajamitra Bhuyan

arXiv:1703.03022·stat.ME·January 21, 2019

On the estimation of population size from a post-stratified two sample capture-recapture data under dependence

Kiranmoy Chatterjee, Prajamitra Bhuyan

PDF

TL;DR

This paper introduces a new model for population size estimation using two sample capture-recapture data that accounts for dependency between capture and recapture, improving accuracy over traditional methods.

Contribution

The paper proposes a novel model that incorporates dependency in capture-recapture data and develops estimation methods for this model, addressing a gap in existing literature.

Findings

01

Proposed model outperforms existing methods in simulations.

02

Method effectively captures dependency between capture and recapture.

03

Illustrated with real data analysis.

Abstract

Population size estimation based on two sample capture-recapture type experiment is an interesting problem in various fields including epidemiology, pubic health, population studies, etc. The Lincoln-Petersen estimate is popularly used under the assumption that capture and recapture status of each individual is independent. However, in many real life scenarios, there is an inherent dependency between capture and recapture attempts which is not well-studied in the literature of the dual system or two sample capture-recapture method. In this article, we propose a novel model that successfully incorporates the possible causal dependency and provide corresponding estimation methodologies for the associated model parameters based on post-stratified two sample capture-recapture data. The superiority of the performance of the proposed model over the existing competitors is established through…

Tables9

Table 1. Table 1: Dual-record-System (DRS): 2 × 2 2 2 2\times 2 data structure with cell probabilities mentioned in [ ] and p ⋅ ⋅ subscript 𝑝 ⋅ absent ⋅ p_{\cdot\cdot} =1

	List 2
List 1	In	out	Total
In	$x_{11} [p_{11}]$	$x_{10} [p_{10}]$	$x_{1, \cdot} [p_{1, \cdot}]$
Out	$x_{01} [p_{01}]$	$x_{00} [p_{00}]$	$x_{0, \cdot} [p_{0, \cdot}]$
Total	$x_{\cdot 1} [p_{\cdot 1}]$	$x_{\cdot 0} [p_{\cdot 0}]$	$x_{\cdot \cdot} = N [p_{\cdot \cdot}]$

Table 2. Table 2: Summary results on the estimators of N A subscript 𝑁 𝐴 N_{A} under the simulation model Model I with N A = 240 subscript 𝑁 𝐴 240 N_{A}=240 and the ratio of the sub-population sizes ( r 𝑟 r ) is unknown.

Population	$α_{A}$	Method	RB	RRMSE	CP( $%$ )	LCI
P1	0.4	MLE	0.0370	0.0632	90	84.47
		MME	0.0684	0.0843	99	107.75
		Nour	0.1669	0.1703	-	-
	0.8	MLE	0.0361	0.0628	99.5	98.84
		MME	0.0650	0.0813	99	99.55
		Nour	0.3377	0.3389	-	-
P2	0.4	MLE	0.0420	0.0647	88.5	95.53
		MME	0.0747	0.0940	96.5	111.58
		Nour	0.1628	0.1660	-	-
	0.8	MLE	0.0413	0.0660	100	99.34
		MME	0.0719	0.0898	98.5	102.39
		Nour	0.3346	0.3358	-	-
P3	0.4	MLE	0.0401	0.0608	85	45.62
		MME	0.0497	0.0634	94.5	76.87
		Nour	0.0847	0.0886	-	-
	0.8	MLE	0.0367	0.0911	92.7	80.89
		MME	0.0475	0.0594	98	71.56
		Nour	0.1701	0.1717	-	-
P4	0.4	MLE	0.0318	0.0495	90	43.18
		MME	0.0449	0.0582	25	73.64
		Nour	0.0822	0.0861	-	-
	0.8	MLE	0.0291	0.0644	88.89	77.53
		MME	0.0400	0.0519	97.5	68.79
		Nour	0.1687	0.1703	-	-
P5	0.4	MLE	0.0391	0.0701	92	102.32
		MME	0.0872	0.1093	79	131.05
		Nour	0.2065	0.2100	-	-
	0.8	MLE	0.0401	0.0724	99	136.63
		MME	0.0791	0.1040	97.5	127.81
		Nour	0.4174	0.4187	-	-
P6	0.4	MLE	0.0377	0.0754	90	101.78
		MME	0.0877	0.1131	96.5	139.50
		Nour	0.2135	0.2169	-	-
	0.8	MLE	0.0487	0.0681	99	141.74
		MME	0.0868	0.1099	98	136.50
		Nour	0.4223	0.4236	-	-

Table 3. Table 3: Summary results on the estimators of N A subscript 𝑁 𝐴 N_{A} under the simulation model Model I with N A = 1200 subscript 𝑁 𝐴 1200 N_{A}=1200 and the ratio of the sub-population sizes ( r 𝑟 r ) is unknown.

Population	$α_{A}$	Method	RB	RRMSE	CP( $%$ )	LCI
P1	0.4	MLE	0.0002	0.0153	99	295.07
		MME	-0.0002	0.0381	99	219.52
		Nour	-0.1661	0.1669	-	-
	0.8	MLE	0.0005	0.0182	99	160.94
		MME	-0.0019	0.0382	98	221.31
		Nour	-0.3371	0.3373	-	-
P2	0.4	MLE	0.0017	0.0138	99.5	412.65
		MME	0.0035	0.0388	98.5	223.56
		Nour	-0.1626	0.1633	-	-
	0.8	MLE	0.0002	0.0161	99.5	165.51
		MME	0.0019	0.0388	99.5	227.80
		Nour	-0.3339	0.3342	-	-
P3	0.4	MLE	0.0027	0.0144	99.5	158.39
		MME	0.0021	0.0262	98	154.01
		Nour	-0.0851	0.0860	-	-
	0.8	MLE	0.0017	0.0125	100	182.97
		MME	0.0008	0.0264	98.5	159.08
		Nour	-0.1723	0.1726	-	-
P4	0.4	MLE	0.0009	0.0089	100	233.09
		MME	0.0025	0.0265	97	151.14
		Nour	-0.0825	0.0834	-	-
	0.8	MLE	0.0012	0.0089	100	291.15
		MME	0.0011	0.0261	97.5	152.76
		Nour	-0.1705	0.1709	-	-
P5	0.4	MLE	0.0008	0.0098	100	226.03
		MME	-0.0009	0.0466	98.5	263.76
		Nour	-0.2064	0.2071	-	-
	0.8	MLE	0.0002	0.0180	100	286.80
		MME	-0.0006	0.0448	99	265.13
		Nour	-0.4208	0.4210	-	-
P6	0.4	MLE	-0.0002	0.0195	99.5	309.30
		MME	0.0026	0.0505	99	276.84
		Nour	-0.2128	0.2135	-	-
	0.8	MLE	0.0002	0.0218	99.5	378.88
		MME	0.0030	0.0508	99.5	277.64
		Nour	-0.4248	0.4250	-	-

Table 4. Table 4: Summary results on the estimators of N A subscript 𝑁 𝐴 N_{A} & N B subscript 𝑁 𝐵 N_{B} under the simulation model Model II with ( N A , N B ) = ( 240 , 200 ) subscript 𝑁 𝐴 subscript 𝑁 𝐵 240 200 (N_{A},N_{B})=(240,200) and the ratio of the sub-population sizes ( r 𝑟 r ) is unknown.

Population	$α_{0}$	Method	RB	RRMSE	CP( $%$ )	LCI
Results on estimators of $N_{A}$
P1	0.4	MLE	0.0028	0.0265	96.5	27.84
		Nour	-0.1692	0.1728	-	-
	0.8	MLE	0.0073	0.0401	96	44.36
		Nour	-0.3382	0.3398	-	-
P2	0.4	MLE	0.0034	0.0300	95.5	27.22
		Nour	-0.1659	0.1691	-	-
	0.8	MLE	0.0067	0.0424	96	42.23
		Nour	-0.3350	0.3365	-	-
P3	0.4	MLE	0.0108	0.0425	92.5	40.81
		Nour	-0.0883	0.0920	-	-
	0.8	MLE	0.0288	0.0742	86.5	57.93
		Nour	-0.1721	0.1741	-	-
P4	0.4	MLE	0.0072	0.0379	95	37.11
		Nour	-0.0857	0.0893	-	-
	0.8	MLE	0.0208	0.0603	92	55.67
		Nour	-0.1704	0.1724	-	-
P5	0.4	MLE	0.0040	0.0291	95.5	31.02
		Nour	-0.2066	0.2103	-	-
	0.8	MLE	0.0126	0.0473	96.5	51.26
		Nour	-0.4188	0.4199	-	-
P6	0.4	MLE	0.0053	0.0328	93.5	32.54
		Nour	-0.2117	0.2157	-	-
	0.8	MLE	0.0087	0.0419	97	48.92
		Nour	-0.4225	0.4236	-	-
Results on estimators of $N_{B}$
P1	0.4	MLE	0.0302	0.0374	94.5	30.61
		Nour	-0.1603	0.1639	-	-
	0.8	MLE	0.0103	0.0505	95	44.01
		Nour	-0.3305	0.3321	-	-
P2	0.4	MLE	0.0076	0.0386	93	29.65
		Nour	-0.1656	0.1699	-	-
	0.8	MLE	0.0095	0.0491	97	44.00
		Nour	-0.3367	0.3383	-	-
P3	0.4	MLE	0.0166	0.0501	93	38.01
		Nour	-0.0854	0.0908	-	-
	0.8	MLE	0.0314	0.0761	88.5	52.43
		Nour	-0.1730	0.1746	-	-
P4	0.4	MLE	0.0120	0.0407	91.5	32.28
		Nour	-0.0795	0.0839	-	-
	0.8	MLE	0.0229	0.0647	92.5	49.06
		Nour	-0.1678	0.1695	-	-
P5	0.4	MLE	0.0042	0.0438	95.5	35.63
		Nour	-0.2014	0.2059	-	-
	0.8	MLE	0.0120	0.0656	93	52.44
		Nour	-0.41283	0.4144	-	-
P6	0.4	MLE	0.0029	0.0456	97	37.18
		Nour	-0.2223	0.2262	-	-
	0.8	MLE	0.0101	0.0604	95.5	51.97
		Nour	-0.42544	0.4270	-	-

Table 5. Table 5: Summary results on the estimators of N A subscript 𝑁 𝐴 N_{A} & N B subscript 𝑁 𝐵 N_{B} under the simulated model Model II with ( N A , N B ) = ( 1200 , 1000 ) subscript 𝑁 𝐴 subscript 𝑁 𝐵 1200 1000 (N_{A},N_{B})=(1200,1000) and the ratio of the sub-population sizes ( r 𝑟 r ) is unknown.

Population	$α_{0}$	Method	RB	RRMSE	CP( $%$ )	LCI
Results on estimators of $N_{A}$
P1	0.4	MLE	0.0004	0.0097	96	46.14
		Nour	-0.1648	0.1654	-	-
	0.8	MLE	0.0001	0.0127	93.5	61.19
		Nour	-0.3371	0.3374	-	-
P2	0.4	MLE	0.0008	0.0111	95	52.49
		Nour	-0.1615	0.1621	-	-
	0.8	MLE	0.0001	0.0135	93.5	64.33
		Nour	-0.3337	0.3341	-	-
P3	0.4	MLE	0.0001	0.0069	93.5	33.13
		Nour	-0.0849	0.0856	-	-
	0.8	MLE	0.0010	0.0088	94.5	41.02
		Nour	-0.1709	0.1714	-	-
P4	0.4	MLE	0.0001	0.0063	94.5	30.10
		Nour	-0.0825	0.0831	-	-
	0.8	MLE	0.0009	0.0084	95	38.98
		Nour	-0.1691	0.1695	-	-
P5	0.4	MLE	-0.0011	0.0121	95	57.61
		Nour	-0.2084	0.2091	-	-
	0.8	MLE	0.0001	0.0157	94.5	75.60
		Nour	-0.4210	0.4212	-	-
P6	0.4	MLE	-0.0012	0.0139	94.5	66.93
		Nour	-0.2149	0.2157	-	-
	0.8	MLE	0.00021	0.0168	95	80.66
		Nour	-0.4253	0.4255	-	-
Results on estimators of $N_{B}$
P1	0.4	MLE	-0.0001	0.0150	95	58.85
		Nour	-0.1603	0.1611	-	-
	0.8	MLE	0.0003	0.0187	95.5	74.76
		Nour	-0.3304	0.3307	-	-
P2	0.4	MLE	-0.0005	0.0152	95	60.38
		Nour	-0.1664	0.1673	-	-
	0.8	MLE	0.0003	0.0192	96.5	75.81
		Nour	-0.3367	0.3370	-	-
P3	0.4	MLE	0.0005	0.0097	94	38.60
		Nour	-0.0879	0.0891	-	-
	0.8	MLE	-0.0008	0.0123	95	48.36
		Nour	-0.1741	0.1745	-	-
P4	0.4	MLE	0.0005	0.0092	94.5	36.81
		Nour	-0.0809	0.0819	-	-
	0.8	MLE	0.0009	0.0121	94.5	46.92
		Nour	-0.1689	0.1693	-	-
P5	0.4	MLE	0.0023	0.0184	95	71.12
		Nour	-0.2012	0.2020	-	-
	0.8	MLE	0.0004	0.0231	95	91.65
		Nour	-0.4165	0.4168	-	-
P6	0.4	MLE	0.0023	0.0196	96.5	77.21
		Nour	-0.2198	0.2207	-	-
	0.8	MLE	0.0003	0.0240	96	95.88
		Nour	-0.4289	0.4293	-	-

Table 6. Table 6: Summary results on the estimators of N A subscript 𝑁 𝐴 N_{A} under the simulation model Model I with N A = 1200 subscript 𝑁 𝐴 1200 N_{A}=1200 and the ratio of the sub-population sizes ( r = 1.2 𝑟 1.2 r=1.2 ) is known.

Population	$α_{A}$	Method	RB	RRMSE	CP( $%$ )	LCI
P1	0.4	MLE	-0.0010	0.0111	100	149.35
		Wolter-2	-0.0013	0.0131	54	23.31
	0.8	MLE	-0.0008	0.0121	100	206.59
		Wolter-2	-0.0013	0.0131	21	9.16
P2	0.4	MLE	0.0014	0.0151	100	152.80
		Wolter-2	0.0007	0.0166	55	30.81
	0.8	MLE	0.0014	0.0157	100	209.97
		Wolter-2	0.0007	0.0166	29.5	14.95
P3	0.4	MLE	0.0013	0.0103	100	140.26
		Wolter-2	0.0005	0.0133	53.5	24.45
	0.8	MLE	0.0012	0.0090	100	157.11
		Wolter-2	0.0005	0.0133	21	10.59
P4	0.4	MLE	0.0013	0.0074	100	140.14
		Wolter-2	0.0008	0.0099	55.5	17.89
	0.8	MLE	0.0008	0.0064	100	151.99
		Wolter-2	0.0008	0.0099	22	5.88
P5	0.4	MLE	-0.0001	0.0101	100	161.52
		Wolter-2	0.0001	0.0184	52	33.49
	0.8	MLE	0.0007	0.0167	100	248.26
		Wolter-2	0.0001	0.0184	27	17.07
P6	0.4	MLE	0.0013	0.0116	100	361.61
		Wolter-2	-0.0005	0.0248	54	45.24
	0.8	MLE	0.0006	0.0115	100	260.15
		Wolter-2	-0.0005	0.0248	31	25.62

Table 7. Table 7: Summary results on the estimators of N A subscript 𝑁 𝐴 N_{A} & N B subscript 𝑁 𝐵 N_{B} under the simulated model Model II with ( N A , N B ) = ( 1200 , 1000 ) subscript 𝑁 𝐴 subscript 𝑁 𝐵 1200 1000 (N_{A},N_{B})=(1200,1000) and the ratio of the sub-population sizes ( r = 1.2 𝑟 1.2 r=1.2 ) is known.

Population	$α_{0}$	Method	RB	RRMSE	CP( $%$ )	LCI
Results on estimators of $N_{A}$
P1	0.4	MLE	0.0003	0.0061	97	5.93
		Wolter-1	-0.0680	0.3516	53.5	1379.14
	0.8	MLE	0.0001	0.0003	90	1.80
		Wolter-1	-0.2125	0.6351	25.5	465.28
P2	0.4	MLE	$< 10^{- 4}$	0.0003	96.5	4.51
		Wolter-1	-0.0216	0.5508	71.5	1290.38
	0.8	MLE	$< 10^{- 4}$	0.0015	95	3.78
		Wolter-1	-0.0881	2.2801	34	580.83
P3	0.4	MLE	0.0002	0.0033	98	12.05
		Wolter-1	0.2069	3.4894	81.5	923.23
	0.8	MLE	0.0004	0.0052	98	15.87
		Wolter-1	-0.1149	0.1825	54	397.55
P4	0.4	MLE	0.0005	0.0068	99	14.82
		Wolter-1	-0.0103	0.3202	83	927.42
	0.8	MLE	$< 10^{- 4}$	0.0003	97.5	17.12
		Wolter-1	-0.0774	0.2726	52.5	420.36
P5	0.4	MLE	0.0002	0.0032	96.5	1.61
		Wolter-1	-0.1223	0.3675	53	1643.76
	0.8	MLE	$< 10^{- 4}$	$< 10^{- 4}$	99.5	1.28
		Wolter-1	-0.2561	0.8123	29.5	508.70
P6	0.4	MLE	$< 10^{- 4}$	0.0003	99	0.91
		Wolter-1	-0.1910	0.4858	69	1431.90
	0.8	MLE	$< 10^{- 4}$	$< 10^{- 4}$	100	3.80
		Wolter-1	-0.2666	0.7618	27.5	612.75
Results on estimators of $N_{B}$
P1	0.4	MLE	0.0002	0.0061	81.5	4.78
		Wolter-1	-0.0688	0.3521	54.5	1379.15
	0.8	MLE	-0.0001	0.0003	89.5	1.56
		Wolter-1	-0.2163	0.6371	28	465.28
P2	0.4	MLE	$> - 10^{- 4}$	0.0004	95.5	3.79
		Wolter-1	-0.0247	0.5520	67	1290.38
	0.8	MLE	$< 10^{- 4}$	0.0015	95	3.25
		Wolter-1	-0.0918	2.2806	31	580.83
P3	0.4	MLE	0.0002	0.0033	98	10.05
		Wolter-1	0.2027	3.4896	82.5	923.23
	0.8	MLE	0.0004	0.0052	98	13.24
		Wolter-1	-0.1194	0.1868	52	397.55
P4	0.4	MLE	0.0005	0.0067	99	12.33
		Wolter-1	-0.0127	0.3211	85	927.42
	0.8	MLE	$< 10^{- 4}$	0.0002	98.5	14.25
		Wolter-1	-0.0816	0.2753	51.5	420.36
P5	0.4	MLE	0.0002	0.0032	97	1.38
		Wolter-1	-0.1243	0.3689	52	1643.76
	0.8	MLE	$< 10^{- 4}$	$< 10^{- 4}$	99.5	1.12
		Wolter-1	-0.2611	0.8149	27	508.70
P6	0.4	MLE	$< 10^{- 4}$	0.0002	99	0.78
		Wolter-1	-0.1982	0.4904	63.5	1431.90
	0.8	MLE	$< 10^{- 4}$	$< 10^{- 4}$	100	3.20
		Wolter-1	-0.2722	0.7651	26	612.75

Table 8. Table 8: Data sets used in illustration of the proposed methods.

Dataset	Stratum	$x_{11}$	$x_{10}$	$x_{01}$	Total
Encephalitis	Adult	39	290	39	368
	Children	20	78	15	113
Children Death	Male	30	153	8	191
	Female	15	173	7	195

Table 9. Table 9: Summary results of real data analysis with proposed Model I and II.

			Model I	Model II	LP
Dataset	Stratum		MLE	MLE	$({\hat{N}}^{(L P)})$
	Adult	${\hat{N}}_{A}$ [RSE]	660 [0.077]	739 [ 0.012]	658 [0.212]
		C.I.	(563, 760)	(731, 748)	(463, 988)
Encephalitis		${\hat{α}}_{A}$	0.052	0.031	-
	Children	${\hat{N}}_{B}$ [RSE]	197 [ 0.104]	213[0.072]	171[ 0.317]
		C.I.	(160, 241)	(160, 241)	(101, 314)
	Male	${\hat{N}}_{A}$ [RSE]	268 [0.054]	250 [0.092]	231[0.244]
		C.I.	(244, 303)	(204, 302)	(151, 362)
Children Death		${\hat{α}}_{A}$	0.070	0.006	-
	Female	${\hat{N}}_{B}$ [RSE]	276 [0.052]	262 [0.097]	275 [0.424]
		C.I.	(250, 306)	(212, 324)	(145, 552)

Equations85

\hat{N}^{(L P)} = \frac{x _{1 \cdot 1} x _{\cdot 1}}{x _{11}},

\hat{N}^{(L P)} = \frac{x _{1 \cdot 1} x _{\cdot 1}}{x _{11}},

\hat{N}^{(N o u r)} = x_{0} + \frac{2 x _{11} x _{10} x _{01}}{x _{11}^{2} + x _{10} x _{01}},

\hat{N}^{(N o u r)} = x_{0} + \frac{2 x _{11} x _{10} x _{01}}{x _{11}^{2} + x _{10} x _{01}},

\hat{N}_{B}^{(W 1)} = ma x (\frac{Q x _{0 B} - x _{0 A}}{Q - r}, x_{0 B}), \hat{N}_{A}^{(W 1)} = ma x (r \hat{N}_{B}^{(W 1)}, x_{0 A}),

\hat{N}_{B}^{(W 1)} = ma x (\frac{Q x _{0 B} - x _{0 A}}{Q - r}, x_{0 B}), \hat{N}_{A}^{(W 1)} = ma x (r \hat{N}_{B}^{(W 1)}, x_{0 A}),

\hat{N}_{B}^{(W 2)} = \frac{x _{1 \cdot B} x _{\cdot 1 B}}{x _{11 B}}, \hat{N}_{A}^{(W 2)} = ma x (r \hat{N}_{B}^{(W 2)}, x_{0 A}) .

\hat{N}_{B}^{(W 2)} = \frac{x _{1 \cdot B} x _{\cdot 1 B}}{x _{11 B}}, \hat{N}_{A}^{(W 2)} = ma x (r \hat{N}_{B}^{(W 2)}, x_{0 A}) .

(Y_{h}, Z_{h}) = {(X_{1 h}^{*}, X_{2 h}^{*}) (X_{1 h}^{*}, X_{1 h}^{*}) \mbox w i t h p r o b . 1 - α, \mbox w i t h p r o b . α,

(Y_{h}, Z_{h}) = {(X_{1 h}^{*}, X_{2 h}^{*}) (X_{1 h}^{*}, X_{1 h}^{*}) \mbox w i t h p r o b . 1 - α, \mbox w i t h p r o b . α,

p_{11} = α p_{1} + (1 - α) p_{1} p_{2},

p_{11} = α p_{1} + (1 - α) p_{1} p_{2},

p_{01} = (1 - α) (1 - p_{1}) p_{2},

p_{Y} = p_{1 \cdot} = p_{1},

p_{Y} = p_{1 \cdot} = p_{1},

(Y_{h}, Z_{h}) = {(X_{1 h}^{*}, X_{2 h}^{*}) (X_{1 h}^{*}, 1 - X_{1 h}^{*}) \mbox w i t h p r o b . 1 - α, \mbox w i t h p r o b . α .

(Y_{h}, Z_{h}) = {(X_{1 h}^{*}, X_{2 h}^{*}) (X_{1 h}^{*}, 1 - X_{1 h}^{*}) \mbox w i t h p r o b . 1 - α, \mbox w i t h p r o b . α .

N_{A} p_{11 A} = x_{11 A},

N_{A} p_{11 A} = x_{11 A},

\hat{N}_{A}^{(1)}

\hat{N}_{A}^{(1)}

\overset{p}{^}_{2 A}^{(1)}

\overset{α}{^}_{A}^{(1)}

L (θ_{1} ∣ \underline{x}_{A}, \underline{x}_{B})

L (θ_{1} ∣ \underline{x}_{A}, \underline{x}_{B})

\overset{p}{^}_{2 A}^{(2)}

\overset{p}{^}_{2 A}^{(2)}

\overset{p}{^}_{2 B}^{(2)}

\overset{α}{^}_{0}^{(2)}

\overset{p}{^}_{1}^{(2)}

\hat{N}_{A}^{(2)}

\hat{N}_{B}^{(2)}

L (θ_{2} ∣ \underline{x}_{A}, \underline{x}_{B})

L (θ_{2} ∣ \underline{x}_{A}, \underline{x}_{B})

R B

R B

R R M S E

[x_{0 A} + (\hat{N}_{A} - x_{0 A}) / C, x_{0 A} + (\hat{N}_{A} - x_{0 A}) C],

[x_{0 A} + (\hat{N}_{A} - x_{0 A}) / C, x_{0 A} + (\hat{N}_{A} - x_{0 A}) C],

N_{A} α_{A} \overset{p}{^}_{1} + (1 - α_{A}) N_{A} \overset{p}{^}_{1} p_{2 A}

N_{A} α_{A} \overset{p}{^}_{1} + (1 - α_{A}) N_{A} \overset{p}{^}_{1} p_{2 A}

N_{A} \overset{p}{^}_{1} (1 - p_{2 A}) (1 - α_{A})

N_{A} p_{2 A} (1 - \overset{p}{^}_{1}) (1 - α_{A})

\hat{N}_{A}

\hat{N}_{A}

N_{A} \overset{p}{^}_{1} + N_{A} p_{2 A} (1 - α_{A}) (1 - \overset{p}{^}_{1})

N_{A} \overset{p}{^}_{1} + N_{A} p_{2 A} (1 - α_{A}) (1 - \overset{p}{^}_{1})

N_{A} (1 - α_{A}) (\overset{p}{^}_{1} - p_{2 A})

N_{A} (1 - α_{A}) (\overset{p}{^}_{1} - p_{2 A})

p_{2 A} (1 - α_{A})

p_{2 A} (1 - α_{A})

α_{A} + p_{2 A} (1 - α_{A})

α_{A} + p_{2 A} (1 - α_{A})

\overset{α}{^}_{A}

\overset{α}{^}_{A}

\overset{p}{^}_{2 A}

\overset{p}{^}_{2 A}

\overset{α}{^}_{A} = min {max {0, \frac{x _{\cdot 1 A}}{x _{1 \cdot A}} - \frac{x _{01 A} x _{\cdot 1 B}}{x _{01 B} x _{1 \cdot A}}}, 1} .

\overset{α}{^}_{A} = min {max {0, \frac{x _{\cdot 1 A}}{x _{1 \cdot A}} - \frac{x _{01 A} x _{\cdot 1 B}}{x _{01 B} x _{1 \cdot A}}}, 1} .

N_{A} p_{1 A} (1 - p_{2 A}) (1 - α_{A})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

On the estimation of population size from a post-stratified two sample capture-recapture data under dependence

Abstract

Population size estimation based on two sample capture-recapture type experiment is an interesting problem in various fields including epidemiology, pubic health, population studies, etc. The Lincoln-Petersen estimate is popularly used under the assumption that capture and recapture status of each individual is independent. However, in many real life scenarios, there is an inherent dependency between capture and recapture attempts which is not well-studied in the literature of the dual system or two sample capture-recapture method. In this article, we propose a novel model that successfully incorporates the possible causal dependency and provide corresponding estimation methodologies for the associated model parameters based on post-stratified two sample capture-recapture data. The superiority of the performance of the proposed model over the existing competitors is established through an extensive simulation study. The method is illustrated through analysis of some real data sets.

Kiranmoy Chatterjee

Interdisciplinary Statistical Research Unit, Indian Statistical Institute

E-mail: [email protected]

Prajamitra Bhuyan

Department of Mathematics, Imperial College London

E-mail: [email protected]

Keywords : Behavioural dependency, Bivariate Bernoulli, Disease surveillance, Method of moments, Maximum likelihood, Post-stratification.

1 Introduction

Estimation of the size of a population is an interesting problem in different disciplines of epidemiological, medical, social and demographic studies. In order to formulate policies for public heath related issues, federal agencies are generally interested to know the actual size of a diseased population (e.g. Encephalitis patients) or vital events (e.g. child mortality) in a specified region. Any attempt to count all the individuals belonging to a population of interest is always subject to error and the degree of error depends on many factors, such as, population size, individual’s capture probability, etc. In this context, two sources of information have extensive use for human population as more than two sources are hardly found in demographic study due to various practical constraints such as survey cost, human mobility, etc. (Chatterjee and Mukherjee, 2016b, ). In order to draw inference from two capture attempts, one needs to combine the data obtained from the two surveys and determine how many people are included in both the lists and how many are included exactly in one of the lists. Therefore, an incomplete $2\times 2$ cross-classified data structure is obtained and it is known as dual-record system (DRS). This data structure is similar to the two sample capture-recapture data (Wolter,, 1986; Chatterjee and Mukherjee, 2016a, ). In DRS, counts for the three cells are available, however the last cell count remained unknown which makes the true population size, say $N$ , unknown. The primary goal is to estimate the missing cell count, or equivalently $N$ , from the available data. This is somewhat close to the capture-recapture experiment, widely practiced in wild-life studies, with only one recapture attempt. Often, survey mechanism allows post-stratification of the entire population into mutually exclusive and exhaustive sub-populations based on demographic and social characteristics (e.g. age, sex, ethnicity, etc.), and it is also of great interest to estimate the sub-population sizes (Bell,, 1993; Wolter,, 1990).

In order to estimate $N$ , a common practice is to assume causal independence between capture and recapture attempts and the resulting estimator is popularly known as Lincoln-Petersen (LP) estimator in DRS (Otis et al.,, 1978; Bohning and Heijden,, 2009). With an additional assumption of time-variation, Otis et al., (1978) proposed the model $M_{t}$ and the resulting estimator is same as LP estimator. Chatterjee and Mukherjee, 2016a proposed an integrated likelihood estimation methodology based on the model $M_{t}$ and compared its performance with others likelihood based estimators in DRS. However, model $M_{t}$ (equivalently, the LP estimator) often fails due to positive dependence among the two lists, especially in the fields of public health and demography, which leads to underestimation of $N$ (Hook and Regal,, 1982; Chao et al.,, 2001). For example, patients with positive result from a serum test for Hepatitis A Virus (HAV) are prone to visit hospital for further treatment. Therefore, the ascertainment of the serum sample and that of the hospital sample becomes dependent. In census-undercount study, Fay et al., (1988) and Bell, (1993) observed such dependence in behavioral response among adult males but not for females in the Post Enumeration Programs conducted for evaluating the US Censuses in 1980 and 1990 respectively. In epidemiological or demographic surveillance with two sample capture-recapture experiment, positive list-dependence is often observed (Chatterjee and Mukherjee, 2016b, ; Schrauder and Hellenbrand,, 2007). Similarly, there are some populations in which negative dependence is encountered, such as children injury data collected by hospitals and police stations, drug abused population, population of patients affected with HIV or any other diseases that bear social stigma (Chatterjee and Mukherjee, 2016b, ; Chatterjee and Mukherjee,, 2018). Recently, Yang and Pal, (2010) have proposed an empirical Bayes estimator which performs better than LP estimator as well as some of its modified versions including Chapman’s and Bailey’s estimators. However, their underlying hypergeometric model does not encounter the list-dependence. In this context, model $M_{tb}$ , proposed by Otis et al., (1978), exclusively includes the list-dependence in terms of behavioral response effect parameter, but this model is not estimable in DRS (Chao et al.,, 2000; Chatterjee and Mukherjee, 2016b, ). Yang and Chao, (2005) proposed a Markov chain approach that incorporates both long-term as well as short-term behavioral response effects in the existing models for capture-recapture experiments. However, their model is also not estimable in DRS (Chatterjee,, 2015).

Modeling of the capture-recapture data incorporating the causal dependence assumption is an important but challenging task in DRS. Nour, (1982) proposed an estimate of total number of vital records assuming the positive dependence between two lists in DRS of vital events registration. Wolter, (1990) provided estimation for post-strata wise sub-population (e.g. male, female) sizes under two different models, assuming the ratio of the sub-population sizes (i.e. the sex-ratio) to be known from Demographic Analysis. In the first model, Wolter, (1990) considered that the cross-product ratios in DRSs for male and female post-strata are same but unknown and, in the second one, causal independence is assumed for the female only. Isaki and Schultz, (1986) also worked on the same problem for 1980 Post Enumeration Program and suggested an estimate based on demographic analysis. Later, Bell, (1993) proposed some variations of the methods suggested by Wolter, (1990) for the estimation of the cross-product ratios for both male and female populations. However, the ratio of the sub-population sizes (e.g. sex-ratio) is calculated at the time of census for larger population (e.g. national level population). In many situations, it is not realistic to assume that the ratio remains constant over time or holds true for the sub-populations under consideration. Moreover, the availability of this ratio for the population of interest is very much limited across the various fields where the DRS type data structure is commonly used (e.g., epidemiological or disease surveillance data; See Section 6).

In this article, we propose a novel model to incorporate this inherent dependency between capture and recapture attempts in DRS without the knowledge on the ratio of the sub-population sizes and provide estimation methodologies for the population size $N$ based on post-stratification under two different scenarios. Our model can also incorporate available information on the ratio of the sub-population sizes and provides better result than the existing competitor. Our work is motivated from two real datasets on public health: (i) Encephalitis incidence in England, 2006-2007 and (ii) child mortality in western Kenya, 2000-2001, where the existing methods proposed by Wolter, (1990) and Nour, (1982) are not applicable. Our model possesses nice interpretation, and associated estimates exhibit superiority with respect to relative bias, relative root mean squared error and coverage probability over the existing competitors available in the literature (See Section 4, 5). We first describe the DRS and the associated data structure in Section 2. In Section 3, we propose a Bivariate Bernoulli model under DRS. Next, in Section 4, we derive method of moments estimates and discuss maximum likelihood estimation of the model parameters. Comparison of the proposed estimators with its existing competitors is studied through extensive simulation and two illustrative data analyses in Sections 5 and 6, respectively. Finally, we end with some concluding remarks in Section 7.

2 Dual-record System (DRS)

As discussed in Section 1, DRS is similar to the two sample capture-recapture sampling which is very common in estimation of the size of human population. Let us consider a population $U$ of size N. The individuals captured in the first list (e.g. census) are matched one-by-one with the individuals captured in the second list (e.g. Post Enumeration Program). Let $p_{j1\cdot}$ and $p_{j\cdot 1}$ denote the capture probabilities of the jth individual in the first sample (List 1) and the second sample (List 2), respectively. Under this set-up, we consider the following assumptions:

( $S1$ ) population is closed until the second sample is taken,

( $S2$ ) individuals are homogeneous with respect to their capture probabilities in each of the two attempts.

Assumption ( $S2$ ) ensures that $p_{j1\cdot}=p_{1\cdot}$ in List 1 and $p_{j\cdot 1}=p_{\cdot 1}$ in List 2 for j = 1, 2, $\cdots$ , N. The data structure, presented in Table 1, is popularly known as the Dual-record system or shortly, DRS. The number of untapped individuals in both the surveys, denoted as $x_{00}$ , is unknown which makes the total population size N unknown. The probabilities attached to all the cells are also provided in Table 1 and these notation will be followed throughout this paper. As discussed before, casual independence is assumed between capture and recapture attempts, which is formally written as $\colon$

$(S3)$ inclusion of each and every individual, belonging to $U$ , in the List 2 is causally independent to its inclusion in the List 1 (i.e. $p_{11}=p_{1\cdot}p_{\cdot 1}$ ).

Now assuming $(S3)$ , estimate of $N$ is found as

[TABLE]

which is popularly known as the Lincoln-Petersen (LP) estimator. This estimator is identical with the conditional likelihood estimator of $N$ from the model $M_{t}$ (Wolter,, 1986) and it is traditionally used in several studies including public health, economics, demography (Bohning and Heijden,, 2009). However, this model is seriously criticized due to its underlying causal independence assumption $(S3)$ in the context of human populations (ChandraSekar and Deming,, 1949; Chao et al.,, 2001). In many situations, failure in capturing one individual in both the attempts may be due to some common causes, and that leads to a positive association between the two lists. In some other cases, individuals may be less keen to be enlisted in List 2 which results in a negative association between the lists. These phenomena are broadly known as behavioral response variation (See Wolter, (1986) for more details).

In the context of demographic studies, Nour, (1982) considered possible positive association between the two lists in DRS. Assuming both the marginal list capture probabilities (i.e. $p_{1\cdot}$ and $p_{\cdot 1}$ ) are greater than 0.5, Nour, (1982) derived the estimate of $N$ as

[TABLE]

where $x_{0}=x_{11}+x_{10}+x_{01}.$

As mentioned in the previous section, Wolter, (1990) considered post-stratification of the entire population $U$ into two mutually exclusive and exhaustive sub-populations, say $U_{A}$ and $U_{B}$ , (e.g., male and female) of sizes $N_{A}$ and $N_{B}$ such that $N_{A}+N_{B}=N$ . Therefore, the observed data, as presented in Table 2, are divided into ( $x_{11A},x_{10A},x_{01A}$ ) and ( $x_{11B},x_{10B},x_{01B}$ ) for the two sub-populations $U_{A}$ and $U_{B}$ , respectively. Based on the above datasets, Wolter, (1990) proposed two models where one common assumption is that the ratio of the sub-population sizes $r=N_{A}/N_{B}$ , is known. In the first model, say Wolter-1, the cross-product ratios for $U_{A}$ (say, $\theta_{A}=\frac{p_{11A}p_{00A}}{p_{10A}p_{01A}}$ ) and $U_{B}$ (say, $\theta_{B}=\frac{p_{11B}p_{00B}}{p_{10B}p_{01B}}$ ), are assumed to be same but unknown, i.e., $\theta_{A}=\theta_{B}$ . The estimates of the sub-population sizes from Wolter-1 are given by

[TABLE]

where $Q=\left(x_{11B}x_{10A}x_{01A}\right)/\left(x_{11A}x_{10B}x_{01B}\right)$ , and $x_{0k}=x_{11k}+x_{10k}+x_{01k}$ is the total numbers of captured individuals from the sub-populations $U_{k}$ for $k=A,B$ . In the second model, say Wolter-2, Wolter, (1990) additionally assumed that the causal independence holds only for the sub-population $U_{B}$ , and the resulting estimates are given by

[TABLE]

See Wolter, (1990) for more details.

3 Proposed Model

In this section, we first introduce a Bivariate Bernoulli model (BBM), which is useful in measuring the degree of association between capture and recapture attempts. Although the problem can be generalized to a multivariate setup for multiple lists problem, in the present paper we focus our attention to the bivariate version only for DRS.

In any given population, some individuals are expected to behave independently over the two capture attempts in DRS and dependence in the behavioral responses may exist for rest of the population. Let $\alpha$ be such proportion of individuals for whom behavioral dependence between the List 1 and List 2 exists. To capture this dependency structure, we define a pair ( $X_{1h}^{*},X_{2h}^{*}$ ), which represents the latent capture statuses of the h-th individual in the first and second attempts, respectively, for $h=1,2,\ldots,N$ . The latent capture status $X_{lh}^{*}$ takes value 1 or 0, denoting the presence or absence of the h-th individual in the $l$ -th list, for $l=1,2$ . Under this setup, for $\alpha$ proportion of individuals, the value of $X_{2h}^{*}$ is same as that of $X_{1h}^{*}$ (i.e. $X_{2h}^{*}=X_{1h}^{*}$ ). Now, let us define $Y_{h}$ and $Z_{h}$ , respectively, as the List 1 and List 2 inclusion status of the h-th individual belonging to $U$ , for $h=1,2,\ldots,N$ . Note that $(Y_{h},Z_{h})$ is manifestation of the latent capture statuses ( $X_{1h}^{*},X_{2h}^{*}$ ) for the h-th individual. Therefore, we can formally write the interdependence among the two lists as

[TABLE]

where $X_{1h}^{*}$ s and $X_{2h}^{*}$ s are independently and identically distributed Bernoulli random variables with parameters $p_{1}$ and $p_{2}$ , respectively. Note that $p_{l}$ refers to the capture probability of a causally independent individual in the l-th list. We call this model, given in equation (5), as Bivariate Bernoulli model in DRS (BBM-DRS). Now, we denote $Prob(Y=y,Z=z)$ by $p_{yz}$ , for $y,z=\{0,1\}$ . Thus, based on the parameters involved in the above model, presented in equation (5), the cell probabilities associated with DRS (See Table 1) are given by:

[TABLE]

The corresponding marginal probabilities are given by

[TABLE]

with $Cov(Y,Z)=\alpha p_{1}(1-p_{1})$ .. Note that the proposed Bivariate Bernoulli model incorporates positive dependence between capture status in Lists 1 and 2. In particular, when $\alpha=0$ (i.e. there is no case of causal dependency), our proposed Bivariate Bernoulli model in (5) reduces to the $M_{t}$ model.

Remark 1.

One can define the proposed BBM-DRS in order to capture negative dependency (or, recapture aversion) by rewriting (5) as

[TABLE]

Remark 2.

The parameters of BBM-DRS possess easy interpretations with practical significance. The dependence parameter $\alpha$ represents proportion of behaviorally dependent individuals, and $p_{l}$ is the capture probability of an causally independent individual in the l-th List, for $i=1,2$ .

4 Estimation Methodologies

In practice, one can easily consider post-stratification of the entire population into two mutually exclusive and exhaustive sub-populations $U_{A}$ and $U_{B}$ as discussed in Section 2 (See Wolter, (1990), Eisele et al., (2003) and Granerod et al., (2013)). We also assume that for any individual, belonging to $U_{A}$ , the capture status in either of the two lists is independent of the same of an individual belonging to $U_{B}$ . In order to denote the cell counts and the associated probabilities for the $2\times 2$ table obtained under the DRS for the sub-population $U_{k}$ , we consider the same notation as mentioned in Table 1, with an additional suffix $k$ (for example, List 1 capture probability for the sub-population $U_{k}$ is denoted as $p_{1\cdot k}$ ), for $k=A,B$ . Now we consider two different models and propose methodologies for estimation of the associated parameters including the population size $N$$(=N_{A}+N_{B})$ , the parameter of primary interest.

4.1 Model I

In this model, we consider the assumption $(S3)$ for the sub-population $U_{B}$ , which implies $p_{11B}=p_{1\cdot B}p_{\cdot 1B}$ . Therefore, the popular Lincoln-Petersen estimate of $N_{B}$ is given as $\hat{N}_{B}=\textstyle\left(\frac{x_{1\cdot B}x_{\cdot 1B}}{x_{11B}}\right)$ . In order to incorporate the behavioural dependency present in the sub-population $U_{A}$ , we consider BBM-DRS as described in Subsection 3, which consists of four parameters with $p_{1}=p_{1A},p_{2}=p_{2A},\alpha=\alpha_{A}$ , and $N=N_{A}$ . In addition to $(S3)$ , we consider the following assumption:

$(S4)$ Initial (List 1) capture probabilities for the individuals belonging to both the sub-populations $U_{A}$ and $U_{B}$ are the same (i.e. $p_{1\cdot A}=p_{1\cdot B}=p_{1},say$ ).

The assumption $(S4)$ ensures estimability of the model parameters. Note that List 1 is prepared before List 2 and hence, List 2 capture probabilities for different sub-populations may differ due behavioral dependence, if exists. Also, it is quite reasonable to consider the same List 1 capture probability for different sub-populations when possibly there is no prejudice. Similar assumption has been considered by several authors in the past (Bell,, 1993). Under similar setup, Wolter, (1990) proposed estimate of $N_{B}$ based on $M_{t}$ model and the estimate of $N_{A}$ using the available knowledge on the ratio of the sub-population sizes (e.g. sex-ratio). As discussed before, the availability of reliable estimate of this ratio remains a practical challenge (See Section 6). As mentioned before, $N_{B}$ is estimated assuming causal independence, and hence, one needs to find the estimate of $N_{A}$ in order to estimate the population size $N$ . Since $\alpha_{A}$ can be interpreted as the proportion of behaviorally dependent individuals, its estimation may provide interesting insight of the capture-recapture mechanism.

First we consider method of moments estimation of the parameters associated with the proposed Model I. Note that the method of moments estimate (MME) of $N_{B}$ is same as the Lincoln-Petersen estimate $\hat{N}_{B}=\textstyle\left(\frac{x_{1\cdot B}x_{\cdot 1B}}{x_{11B}}\right)$ , and the MMEs of $p_{1B}$ and $p_{2B}$ are given as $\hat{p}_{1B}=\frac{x_{11B}}{x_{\cdot 1B}}$ and $\hat{p}_{2B}=\frac{x_{11B}}{x_{1\cdot B}}$ , respectively. Using the assumption $(S4)$ , the estimate of $p_{1A}$ is given by $\hat{p}_{1A}=\hat{p}_{1}=\frac{x_{11B}}{x_{\cdot 1B}}$ . Now, equating the expected and observed number of cell counts in the $2\times 2$ table obtained under the DRS (Table 1) for the sub-population $U_{A}$ , we get

[TABLE]

which involve three unknown parameters $N_{A},p_{2A}$ , and $\alpha_{A}$ . Solving these equations in (6), the MMEs of the model parameters are obtained as

[TABLE]

The detailed derivation for finding the above mentioned MMEs are provided in the Appendix.

A classical approach for estimating $N$ from an incomplete $2\times 2$ cross-classified data structure, is based on likelihood theory, where the data (i.e. all observed cell counts in Table 1) follow a multinomial distribution with index parameter $N$ and the associated cell probabilities $\{p_{ijk}:i,j=0,1;i=j\neq 0,k=A,B\}$ (Sanathanan,, 1972). Therefore, using the relations between the cell probabilities $\{p_{ijk}\}$ and $\theta_{1}=(N_{A},N_{B},\alpha_{A},p_{1},p_{2A},p_{2B})$ , as provided in Section 3, the likelihood function of $\theta_{1}$ is given by

[TABLE]

where $\underline{\textbf{x}}_{k}=\left(x_{11k},x_{10k},x_{01k}\right)$ , $x_{0k}=x_{11k}+x_{10k}+x_{01k}$ , for $k=A,B$ . However, explicit solution for maximum likelihood estimate (MLE) of $\theta_{1}$ is not possible. The Newton-Raphson method can be used to maximize the log-likelihood in order to estimate $\theta_{1}$ , assuming $N_{A}$ and $N_{B}$ as continuous parameters. Alternatively, any standard software package equipped with general purpose optimization (e.g., optim in the package R) can be used. Note that the log-likelihood function involves $\ln(N_{A}!)$ , which may create computational difficulty for large values of $N_{A}$ . In order to avoid such issues we approximate $\ln(N_{A}!)$ as $N_{A}\ln(N_{A})-N_{A}+\frac{1}{2}\ln(2\pi N_{A})$ (Wells,, 1986, p. 45).

Remark 3.

The above likelihood function (7) can be simplified using Stirling’s approximation of $\ln(N_{A}!)\approx N_{A}\ln(N_{A})-N_{A}$ (Whittaker and Robinson,, 1967, p. 138-140), and obtain closed form expression of the MLEs. Interestingly, the MLEs for all the parameters are exactly equal to the respective MMEs.

Remark 4.

If the ratio of the sub-population sizes (e.g. sex-ratio for male-female stratification) $r$ is known, one can easily incorporate such information in the likelihood function (7) taking $N_{B}=r^{-1}N_{A}$ .

4.2 Model II

In Model II, we relax the assumption $(S3)$ and the BBM-DRS is considered for both the sub-populations $U_{A}$ and $U_{B}$ with parameters $p_{1}=p_{1k}$ , $p_{2}=p_{2k}$ , $\alpha=\alpha_{k}$ , and $N=N_{k}$ , for $k=A,B$ . Similar to Model I, we consider the assumption $(S4)$ (i.e. $p_{1A}=p_{1B}=p_{1}$ , say) and additionally we assume $\alpha_{A}=\alpha_{B}=\alpha_{0}$ , say, which ensures estimability of Model II. Under similar setup, Wolter, (1990) proposed estimates of $N_{A}$ and $N_{B}$ using the ratio of the sub-population sizes. As discussed before, reliable estimate of this ratio is not available in most of the cases.

We first consider the method of moments for estimating the parameters associated with the Model II. We equate the expected and observed cell counts from the $2\times 2$ tables obtained under the DRS involving six parameters $N_{A},N_{B},p_{1},p_{2A},p_{2B},\alpha_{0}$ and find the following MMEs as

[TABLE]

The derivation for finding the above mentioned MMEs is similar to that of Model I. See Appendix for more details. In some cases $\hat{p}_{2A}>\frac{x_{01A}}{x_{01A}-x_{10A}}$ , and hence, the estimates for $p_{1}$ , $N_{A}$ and $N_{B}$ become negative, as in Wolter, (1990). Such issues with MME has been discussed in the literature (See Bowman and Shenton, (1998, p. 2092-2098) for more details). Therefore, it is not advisable to use MME for the proposed Model II and one should prefer the maximum likelihood estimates as provided below.

Using the relations between the cell probabilities $\{p_{ijk}\}$ and $\theta_{2}=(N_{A},N_{B},\alpha_{0},p_{1},p_{2A},p_{2B})$ , as provided in Section 3, the likelihood function of $\theta_{2}$ is given by

[TABLE]

where $\underline{\textbf{x}}_{k}=\left(x_{11k},x_{10k},x_{01k}\right)$ , $x_{0k}=x_{11k}+x_{10k}+x_{01k}$ , for $k=A,B$ . Since, the explicit solution for MLE of $\theta_{2}$ cannot be obtained, same computational strategy is followed here as in the case of Model I. As remarked in Subsection 4.1, here also one can consider the same reparameterization $N_{B}=r^{-1}N_{A}$ in the likelihood function (8), if the ratio of the sub-population sizes $r$ is known.

5 Simulation Study

In this section, the performance of the proposed estimators are thoroughly investigated based on simulation study and compared with the existing competitors. For this purpose, we consider six trial populations, denoted by $P1,\ldots,P6$ , with the choices of capture probabilities $\left(p_{1\cdot k},p_{\cdot 1k}\right)=\left\{(0.60,0.80),(0.60,0.70),(0.80,0.55),(0.80,0.70),(0.50,0.75),(0.50,0.60)\right\}$ , respectively, for $k=\{A,B\}$ , with $(N_{A},N_{B})=(240,200)$ and $(1200,1000)$ . We present the simulation study in two fold. Firstly, we consider the ratio of the sub-population sizes $r$ ( $=N_{A}/N_{B}$ ) is unknown and compare the performance of our proposed estimators with the Nour’s (Nour,, 1982) estimator given by (2). As discussed before, the estimators, (3) and (4), proposed by Wolter, (1990) are not applicable here. Secondly, we consider $r$ is known and the Wolter’s Wolter, (1990) estimators are compared with the proposed estimators. It is important to note that the Nour’s (Nour,, 1982) method is unable to incorporate the knowledge on $r$ .

First, we generate 1000 data sets from Model I for each of the six said trial populations $P1-P6$ with $\alpha_{A}=0.4,0.8$ . As the LP estimator of $N_{B}$ produces efficient results under the causal independence assumption (S3) for large or moderately large samples, our primary interest in Model I lies in the estimate of $N_{A}$ based on MME and MLE. Final estimate of $N_{A}$ is obtained by averaging the estimates over $1000$ replications. To compare the performance of the estimators, we compute relative bias (RB) and relative root mean square error (RRMSE) using the following formula:

[TABLE]

In the capture-recapture setting, point estimators of population size are commonly possess positively skewed distributions (Yang and Pal,, 2010). Therefore, we obtain $95\%$ confidence interval (C.I.) for $N_{A}$ based on the log-transformation method, discussed in Chao et al., (1987) and Yang and Chao, (2005). In this method, $log(\hat{N}_{A}-x_{0A})$ is approximately treated as normal variate and that gives $95\%$ confidence interval as

[TABLE]

where $C=exp\left\{1.96\left[log\left(1+\hat{\sigma}_{\hat{N}_{A}}^{2}/(\hat{N}_{A}-x_{0A})^{2}\right)\right]^{1/2}\right\}$ , and $\hat{\sigma}_{\hat{N}_{A}}^{2}$ is the estimate of the variance of $\hat{N}_{A}$ . For each of the 1000 replications, $\hat{\sigma}_{\hat{N}_{A}}$ is computed using parametric bootstrap method based on 1000 bootstrap samples. Length of the $95\%$ confidence interval (LCI) as well as its coverage probability (CP) are computed following the methods discussed in Yang and Pal, (2010) and Chatterjee and Mukherjee, (2018). First, we need to compare the CPs of each of the estimators to see which one performs the best. Further, we need to compare the LCIs when coverage probabilities (CPs) are found to be similar (Yang and Pal,, 2010). Note that Nour’s (Nour,, 1982) estimator is not model-based and corresponding CP and LCI cannot be obtained by the aforementioned parametric bootstrap method. The results are presented in Tables 2 and 3 for true population size $N_{A}=240$ and $1200$ , respectively.

From Table 2, it is observed that both the proposed estimators (MME and MLE) of $N_{A}$ outperform the Nour’s (Nour,, 1982) estimator in terms of RB and RRMSE. One can also observe that the RB and RRMSE of the MLE are smaller compared to that of the MME. Interestingly, the performance of the MLE and MME are comparable with respect to CP and LCI. As expected, both RB and RRMSE of the proposed estimators decrease as the population size $N_{A}$ increases.

Next, we generate data from Model II considering the same trial populations $P1-P6$ along with common dependence parameter $\alpha_{0}=0.4,0.8$ . Similar to the case of Model I, we obtain RB, RRMSSE, CP, and LCI for estimators of both $N_{A}$ and $N_{B}$ , and the results are presented in Tables 4 and 5. As discussed before, the proposed MME from the Model II is often found to be negative; hence, these estimator has not been considered for this simulation study. It is clear from the results presented in Table 4 that the performance of the proposed MLE under Model II is significantly better than that of Nour, (1982) both in terms of RB and RRMSE. Nour’s (Nour,, 1982) estimator underestimates the $N_{A}$ and $N_{B}$ , where as the biases incurred by our proposed MLE are negligible for both the sub-population sizes. The results from Table 4 indicate that the interval estimates based on the proposed MLE performs efficiently both in terms of CP as well as LCI. As expected, the RB and RRMSE of the MLE decreases as the population sizes $N_{A}$ and $N_{B}$ increases.

As mentioned in Remark 4, information on the ratio of the sub-population size $r$ , if available, can be incorporated in our proposed likelihood based estimate. It is important to note that the estimate of the ratio of sub-population sizes $r$ may be available for large population based on previous studies (Wolter,, 1990). For example, in a census coverage study, estimate of the sex-ratio may be available from a past demographic analysis of the population under consideration(Robinson et al.,, 1993). Therefore, assuming $r$ to be known, we presented this analysis only for the large populations, that is for $(N_{A},N_{B})=(1200,1000)$ . The performance of our proposed estimator under Model I (Model II) is compared with the estimator of Wolter-2 (Wolter-1) and the results are presented in Table 6 (Table 7). It is clearly seen that the proposed estimator is superior than Wolter’s estimators with respect to RB and RRMSE. Moreover, our models produce far better CPs of its 95% CIs than that of the models proposed by Wolter, (1990) for both the choices of $\alpha_{A}$ or $\alpha_{0}$ . The resulting CIs from Wolter-2 has shorter lengths than that of our Model I, however, Wolter-1 exhibits much more wider confidence intervals compared to the proposed Model II. Similar results are also observed (not reported here) for $(N_{A},N_{B})=(240,200)$ .

6 Applications

In this section, we first analyze a data set on Encephalitis (infectious and noninfectious) incidence in England during November 2006 to October 2007 (Granerod et al.,, 2013), presented in the top panel of Table 8. This particular data was collected adhering to an encephalitis code in any of the 20 diagnostic fields, and segregated into two strata, Children ( $<18$ years) and Adult ( $\geq 18$ years). A patient detected with encephalitis by a hospital clinician was likely to be recorded in Hospital Episode Statistics (HES) and also included in the the Public Health England (PHE) study. Thus, Granerod et al., (2013, p. 1461) anticipated that the two sources are likely to be positively dependent. As a result, the LP estimator, given in (1), probably underestimate the true number of cases. Note that the estimator proposed by Nour, (1982) cannot be applied for both the strata as its underlying condition ( $x_{11}^{2}>x_{10}x_{01}$ ) is not valid. Also, the estimators proposed by Wolter, (1990) can not be applied as the ratio of adult and child patients (equivalent to sex-ratio for male-female stratification) is not available here. Therefore, we compare the results from our proposed models with that of the LP estimator defined in (1).

As remarked in Subsection 4.1, the MMEs are approximately equal to the MLEs under Model I, and hence we only consider MLE for our data analysis. For analyzing the data under Model I, we further consider both the cases separately where capture recapture status for Children and Adult are independent. In order to compute the estimate of standard error $\hat{\sigma}_{\hat{N}}$ , we use the same parametric bootstrap method as mentioned in Section 5. Comparing the relative standard error (RSE), i.e. $\hat{\sigma}_{\hat{N}}\left/\right.\hat{N}$ , we find that our proposed estimator under Model I performs better with independent assumption for Children than that for Adult and the corresponding results are reported in the top panel of Table 9. Estimate of the dependence parameter indicates that $5\%$ of the Adult encephalitis patients are causally dependent. Under Model II, the estimated number of patients is larger compared to that of under Model I. The estimated proportion of causally dependent patients for both Adult and Children are $3\%$ under Model II. It is interesting to note that the relative standard error (RSE), based on 1000 bootstrap samples, of the MLE under Model II (Model I) is substantially smaller compared to those of the MLE under Model I ( $M_{t}$ Model).

Now we consider another dual system dataset (See bottom panel of Table 8) from Wagai and Yala divisions in western Kenya on child mortality, named as Gem in the article by Eisele et al., (2003). This study is on the completeness and differential ascertainment of vital events related to child health among male and female children (less than five years old) registered in demographic surveillance system (DSS) based on two-sample capture-recapture experiment. Here also, both the methods, proposed by Wolter, (1990) and Nour, (1982), are not applicable because of same reasons mentioned earlier. Analyzing the data, we find that performance of our proposed estimator $\hat{N}_{k}$ s under Model I, for $k=\text{Male, Female}$ , performs better with the assumption that capture recapture status for Female are independent. The results are presented in the bottom panel of Table 9. It is seen that the estimates for female deaths based on Model I and $M_{t}$ model are very close, however the r.s.e is for Model I is smaller compared to that of $M_{t}$ model. Estimate of the dependence parameter indicates that $7\%$ of male child are causally dependent under Model I. Under Model II, the MLEs are marginally lower compared to those of under Model I. In this case the RSE of the estimates under Model I(Model II) is smaller compared to those under Model II ( $M_{t}$ model). Based on our analysis no evidence of list-dependence was found in the DRS under consideration which supports the argument made by Eisele et al., (2003).

7 Concluding Remarks

This article deals with a very interesting problem when causal independence assumption in DRS is not valid. We introduce a model, called Bivariate Bernoulli model, that successfully accounts for the possible dependence between capture and recapture attempts. Though the proposed model discusses positive correlation, one can rewrite the model easily in order to incorporate negative dependence (See Ramark 1). Our proposed model seems to have an edge in terms of ease of interpretation and has much wider domain of applicability. In case, the ratio of the subpopulation sizes (e.g., sex-ratio for male-female stratification) is known, estimates based on our proposed models may be preferred. This also allows inclusion of any additional information (e.g. sex-ratio), if available, to make more efficient inference. Although the primary objective of this article is to obtain an efficient estimate of the population size $N$ , the estimates of the other model parameters, especially $\hat{\alpha}$ , give specific insights into the capture-recapture mechanism. The BBM can also be extended for multiple list or multiple capture-recapture problems which is commonly encountered in the study of wildlife population. It is also an interesting problem to develop a testing procedure to test the behavioral dependence between two sources in DRS, which will be taken up in future work.

Appendix

**Derivation for MME under Model I:

**We get from (6), the following equation in terms of $p_{2A}$ , $\alpha_{A}$ , and $N_{A}$ :

[TABLE]

where $\hat{p}_{1}=\hat{p}_{1\cdot A}=\frac{x_{11B}}{x_{\cdot 1B}}$ . Now, by adding (9) and (10), we get the MME of $N_{A}$ as

[TABLE]

Again, by adding the equations (9)-(11),

[TABLE]

and by subtracting (11) from (10), we get

[TABLE]

Now, using the estimates $\hat{N}_{A}$ and $\hat{p}_{1}$ in (12), we get

[TABLE]

Since $N_{A}\hat{p}_{1}=x_{1\cdot A}$ , (9) implies

[TABLE]

Subtracting (13) from (14), the MME of $\alpha_{A}$ is obtained as

[TABLE]

Using $\hat{\alpha}$ in (13), MME of $p_{2A}$ is given as

[TABLE]

In order to ensure that MME of $\alpha_{A}$ lies in $[0,1]$ , we modify (15) and consider

[TABLE]

$\Box$

Derivation for MME under Model II: We get from (6), the following equation in terms of $p_{1A}$ , $p_{2A}$ , $\alpha_{A}$ , and $N_{A}$ :

[TABLE]

Now, dividing (16) by (17) we get

[TABLE]

Next, we equate the expected and observed number of cell counts from the 2 $\times$ 2 table obtained under DRS for the sub-population $U_{B}$ and get

[TABLE]

Now, we consider the assumption (S4) and $\alpha_{A}=\alpha_{B}=\alpha_{0}$ . Therefore, dividing (16) by (19) we get

[TABLE]

since $N_{A}=\frac{x_{1\cdot A}}{x_{1\cdot B}}N_{B}$ .

Similarly, dividing (17) by (20) we get

[TABLE]

From equations (21) and (22), we get

[TABLE]

and

[TABLE]

Therefore, by putting the above estimate $\hat{p}_{2A}$ in equations (16) and (18), we get

[TABLE]

and

[TABLE]

respectively. Finally, we obtain the estimates of sub-populations sizes as

[TABLE]

since $N_{A}p_{1A}=x_{1\cdot A}$ and $N_{B}p_{1B}=x_{1\cdot B}$ .

Bibliography27

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bell, (1993) Bell, W. R. (1993). Using information from demographic analysis in post-enumeration survey (PES) estimation. Journal of the American Statistical Association , 88:1106–1118.
2Bohning and Heijden, (2009) Bohning, D. and Heijden, P. V. D. (2009). Recent developments in life and social science applications of capture–recapture methods. Advanced Statistical Analysis , 93:1–3.
3Bowman and Shenton, (1998) Bowman, K. O. and Shenton, L. R. (1998). Encyclopedia of Statistical Sciences . John Wiley & Sons.
4Chandra Sekar and Deming, (1949) Chandra Sekar, C. and Deming, W. E. (1949). On a method of estimating birth and death rates and the extent of registration. Journal of the American Statistical Association , 44:101–115.
5Chao et al., (2000) Chao, A., Chu, W., and Chiu, H. H. (2000). Capture-recapture when time and behavioral response affect capture probabilities. Biometrics , 56:427–433.
6Chao et al., (1987) Chao, A., Tsay, P. K., Lin, S-H. Shau, W.-Y., and Chao, D.-Y. (1987). Estimating the population size for capture-recapture data with unequal catchability. Biometrics , 43:783–791.
7Chao et al., (2001) Chao, A., Tsay, P. K., Lin, S-H. Shau, W.-Y., and Chao, D.-Y. (2001). Tutorial in biostatistics: The applications of capture-recapture models to epidemiological data. Statistics in Medicine , 20:3123–3157.
8Chatterjee, (2015) Chatterjee, K. (2015). Comment on Yang and Chao (2005), on the identifiability of model MM 1(tb) for two sample capture-recapture experiments. DOI: 10.13140/RG.2.1.4580.1685 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

On the estimation of population size from a post-stratified two sample capture-recapture data under dependence

Abstract

1 Introduction

2 Dual-record System (DRS)

3 Proposed Model

Remark** 1****.**

Remark** 2****.**

4 Estimation Methodologies

4.1 Model I

Remark** 3****.**

Remark** 4****.**

4.2 Model II

5 Simulation Study

6 Applications

7 Concluding Remarks

Appendix

Remark 1.

Remark 2.

Remark 3.

Remark 4.