Integrating endogeneity in survey sampling using instrumental-variable calibration estimator
Muhammad Nadeem Intizar, Muhammad Ahmed Shehzad, Haris Khurram, Soofia Iftikhar, Aamna Khan, Abdul Rauf Kashif

TL;DR
This paper introduces a new method to improve survey sampling when some variables are misleading, using instrumental-variable calibration to reduce bias and increase accuracy.
Contribution
The novelty lies in proposing instrumental-variable calibrated estimators that outperform conventional methods in the presence of endogenous auxiliary variables.
Findings
Instrumental-variable calibration estimators reduce bias and variance in survey sampling with endogenous variables.
Simulation and real data examples confirm the improved performance of the proposed estimators.
The method is more efficient than traditional calibration when auxiliary variables are endogenous.
Abstract
The endogeneity problem arises when the auxiliary variables correlate to the error terms. In such cases, appropriate instrumental variables ensure efficient estimation. Calibration has recognized itself as an important methodological tool at a large scale to estimate the population total in survey sampling. Which does not offer efficient estimation in the presence of endogeneity. When endogeneity is present in the auxiliary variables, the calibration using endogenous auxiliary variables may produce biasedness and increase variance due to inappropriate model assumptions. In this article, we propose instrumental-variable calibrated estimators by using the classical instrumental-variables approach for the case of exact identification that are more efficient than conventional calibration estimators when some auxiliary variables are endogenous. The necessary properties of the proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurvey Sampling and Estimation Techniques · Census and Population Estimation · Survey Methodology and Nonresponse
Introduction
1
Estimation of population total or means has significance while considering the survey data. Various researchers have proposed different estimators to estimate population total and mean under different sampling designs and by considering different problems in survey data. Liu and Arslan [1] proposed the estimators for population mean using auxiliary proportions. Ahmad et al. [2] suggested the generalized estimators for population mean. Wang et al. [3] Derived estimators for population mean by simple and double sampling in situations of extreme values. The calibration technique was derived by Deville and Särndal [4] to obtain an estimator of the population total using some sample weights called calibrated weights. These weights are obtained by minimizing the distance to the Horvitz-Thompson weights with the condition on the calibration equations to be satisfied. The resulting weights will be a function of the auxiliary variables.
Suppose we wish to estimate the total of the variable of interest in a finite population . A probability sample is selected from the population with sampling design , and is the value of k-th unit of the study variable for all (complete response) with a known inclusion probability for each element k, and the corresponding sampling design weight . A vector of p auxiliary variables is the transposed vector whose elements are the values of the auxiliary variables for the unit associated with . We observe ( , for the elements . The population total of is is known and Horvitz-Thompson estimators is Deville and Särndal [4] suggested the calibration estimator defined in equation (1.1) as
where weights selected to satisfy
To minimize the distance between the design weights and initial weights , any distance function suggested by Deville and Särndal [4] can be minimized under some basic conditions with constraints given in eq. (1.2). Thus, calibration weights are linear functions of design weights and available auxiliary information. If Langrange multipliers vector. Then the Lanragian equation can be written as equation (1.2).
So and can be found by the method of Newton's optimization discussed in equation (1.3) as:
so the value of is
Hence we get the calibrated weights in equation (1.4) as:
where .
The proposed calibrated weights gave the different results for different distance functions. Deville and Särndal [4] suggested different distance functions. The chi square distance function gave the class of calibrated weights such as
where in equation (1.5) is the parameter that can be chosen to for improved calibrated weights and relative efficiency. Estevao and Särndal [5] used arbitrary positive value of to improve the calibrated estimator. Which is the same as the generalized regression estimator (GREG) proposed by Cassel et al. [6] and the obtained estimator can be deduced as a model-based and design-based estimator Cardot et al., 2017. [7].
where in equation (1.6). However, this minimum distance technique in calibration offers almost identical estimators for different distance functions. For studying the properties of calibrated estimators, Estevao and Särndal [5] suggested calibration estimators under two-phase sampling. Shehzad [8] and Goga and Shehzad [9] produced the penalized calibrated estimators. Shehzad et al. [10] and Brirah et al. [11] proposed modified calibration methods for estimating the population total. Alam and Hanif [12] proposed cosmetic calibration estimators. Kott [13], Kott [14], Särndal [15], and Kim (2010) also used the calibration technique for different conditions to derive the calibrated estimators. Park and Kim [16] proposed model-based instrumental-variable calibrated estimators to minimize the anticipated variance in calibration estimator also used under two-phase sampling. Endogeneity is a classical problem which arises due to the correlation between the independent variables and error terms. Wooldridge [17] suggested to use an instrumental variable . Which are highly correlated with each endogenous component of but independent of to deal the problem of endogeneity. In survey data, the problem of endogeneity also arises when we model the data to estimate the population total. When endogeneity is present in the auxiliary variables, the calibration using endogenous auxiliary variables may produce biasedness and increase variance due to inappropriate model assumptions. This estimation problem has not been addressed in calibration estimation.
In this paper, we proposed the instrumental-variable calibration estimator using model-assisted and model-based approaches when some auxiliary variables are endogenous. The mathematical properties of the proposed estimator were verified, and the performance of the proposed estimator was evaluated using a simulation study and real data. In sections 3, 4, properties of proposed estimators are presented. In section 5, the performance of the estimators has been evaluated by a simulation study and a real data example.
Instrumental variables (IV) regression
2
One of the most important assumptions of the Classical Linear Regression Model (CLRM) is that the regressors are exogenous. The violation of this assumption that is, the regressors are correlated with the error term, is called Endogeneity. The solution to this violation is the method of Instrumental-variables (IV). An estimator for which the endogenous and instrumental variables are the same is referred to as just or exact identified. An estimator for which the instrumental variables are more than the endogenous variables is called the over-identified estimator [18]. Wright [19] first introduced instrumental variables and used them to estimate supply and demand elasticity for butter and flaxseed. Reiersøl [20] applied the same method in the context of errors-in-variables models in his dissertation. Let be a matrix of known regressors and suppose the following super population regression model.
is a vector of the dependent variable, and is non-random matrix of independent variables. Also is a full-rank matrix and is a vector of residuals also assumed that the expected value of is zero and are uncorrelated. The variance of is constant (homoscedastic), i.e. , also assumed that and are independent, i.e. . It means that the explanatory variables are exogenous and is vector of unknown parameters. Then the ordinary least square (OLS) estimator is
The ordinary least square estimator is unbiased and has minimum variance such as
Hence is an unbiased and consistent estimator of . On the other hand, when and are correlated, that is , it means that the explanatory variable is endogenous then the OLS estimator is biased and inconsistent. In this situation, it is good to use the estimates to predict the value of the dependent variable given the value of . However, the estimate does not recover the causal effect of on . So, to estimate the parameter consider a set of variables (instrumental variables) which are highly correlated with each endogenous component of but independent of [17]. If the relationship between each endogenous component of and the instrument is defined in equation (2.2) and given as:
Then the instrumental variable (IV) estimator is
Instrumental–variable estimator in equation (2.3) is unbiased and consistent under certain regularity conditions.
Instrumental-variable calibration approach
3
The calibration approach is usually used without assuming the super population model [4]. The calibration technique consists of estimating the population total such as
with constraint in equation (1.2) i.e.
The distance function (chi-square distance) is
where . Then Lagrange multiplier is
So taking derivatives of L co in equation (3.1) we obtained the value of . By putting the value of we finally get the weights as:
hence the calibration estimator of using equation (3.2) becomes
We propose the instrumental-variable calibration estimator by the instrumental-variable calibration approach proposed by Ref. [5] without using the distance minimum function approach such as
where is the calibrated weight obtained by the instrumental-variable approach subject to
The weight with unknown is
where , is a positive integer in the present study, we take , and is the sample restriction of Z, the classical instrumental variable used instead of the endogenous auxiliary variable. By plugging in the weights in the calibration constraint we find the value of as
Put the value of in equation (3.3) weights equation and finally, we get the required weights as
so, the instrumental-variable calibration estimator for the total by using equation (3.4) is as:
where . The estimator defined in equation (3.5) is a model-assisted (designed-based) instrumental-variable estimator.
Properties of model-assisted instrumental-variable calibration estimator
3.1
Some properties of the model-assisted Instrumental-variable calibration estimator ) are presented and their proof are available in appendix.Theorem 1The model-assisted Instrumental-variable calibration estimator ( is biased, and its biases are given by
where
Theorem 2The asymptotic variance of the instrumental-variable model-assisted calibration estimator ) is given by
if then the asymptotic variance of is
Model-based instrumental-variable calibration approach
4
Usually, without the auxiliary information, is determined by the Horvitz-Thompson [21] estimator, defined as
The estimator in equation (4.1) may be improved by using the auxiliary variables in the form of model-based estimation. A model identified the set of conditions that describe a class of distribution of [22]. Kumar et al. [23] proposed the model-based calibration estimator when the study and auxiliary variables are inversely related. We propose a model-based instrumental-variable calibration estimator of by the Instrumental-variable calibration approach proposed by Ref. [5] under the model given in equation (2.1) as:
which does not satisfy the assumption of exogeneity, that is . We propose a model-based instrumental-variable calibration estimator of Y as:
where calibrated weights which are obtained by the instrumental-variable calibration technique. Subject to the constraint
Since is endogenous, we use instrumental-variable instead of endogenous auxiliary variables. By using the Instrumental-variable calibration approach proposed by Ref. [5], the weights in equation (4.2) become
Plug in the value of weight in equation (3.1) we get . By solving it we find the value of as . Plug in the value of in equation (4.3) final weights are as:
thus, the proposed Instrumental-variable model-based calibration estimator of using equation (4.4) becomes
which is equivalent to the
where . So, the model-based instrumental-variable calibration weights ( ) perform a similar character to the calibrated weights under certain conditions.
Properties of model-based instrumental-variable calibration estimators
4.1
Some Properties of the model-based Instrumental-variables calibration estimator are presented as theorems.Theorem 3The model-based Instrumental-variable calibration estimator is biased, and its bias is given as
where .Theorem 4The Mean Square Error of the model-based Instrumental-variable calibration estimator is given as
Simulation scheme
5
In this section, we draw the empirical results to check the efficiency of the estimators by the Monte Carlo simulation. The present simulation study generates a finite population of size . For this population, 20 variables of size 1000 (X matrix is of dimension were generated using normal distribution, in which some are adjusted to have correlation with error terms using a linear function. In this way, they are endogenous. The finite population is based on the pair ( ) such that and are linearly related, and the relation obtains the variable of interest as defined in equation (2.1). The value of is taken as 1. The total value of which is assumed to be the true population total. Instrumental auxiliary variables were also generated using normal distribution but with the assumption that they are correlated with auxiliary variable are unrelated to error terms. A sample of size were taken using Simple Random Sampling without Replacement (SRSWOR) for each draw. Different number of endogenous variables ( , , were considered by using a linear model so the error terms relate to corresponding auxiliary variables. Then each endogenous variables replaced by the Instrumental-variable , generated to be independent of the error term and correlated with its endogenous auxiliary variable. The number of simulations was and generated data were kept fixed in each simulation. All the computational work was done in R language.
Performance evaluation
5.1
The performance evaluation of the proposed estimators with conventional estimator is presented using following measures.
Bias: which is calculated for estimated total ( ) such as:
Mean Square Error (MSE): which is calculated for estimated total ( ) such a
Simulation result
5.2
The results are presented in Table 1, Table 2, Table 3, Table 4, Table 5. These results show the behaviour of all the considered estimators: the HT estimator, GREG or conventional calibration estimator, and Instrumental-Variable Calibration (IVC) estimator for different endogenous auxiliary variables for 20 total auxiliary variables for different sample sizes by (SRSWOR). For every table, the performance of each estimator is examined with two properties Bias and Mean Square Error (MSE).Table 1. Monte Carlo Bias and Mean Square Error (MSE) with one endogenous variable i.e. is endogenous.Table 1. Sample SizeEstimatorBiasMSEHT−16.361873393.00025GREG−778.672621246.000IVC−778.658621221.000HT−1.190425585.00050GREG−382.337156225.000IVC−382.333156224.000HT−4.97125723075GREG−245.27465207.400IVC−245.2765204.100HT−2.504190901.000100GREG−183.9336814.400IVC−183.9336813.300HT−13.283125723.000150GREG−114.26014322.300IVC−114.26014322.200HT−4.84193744.400200GREG−80.4327159.710IVC−80.4347159.450HT−6.55371821.200250GREG−60.6154079.350IVC−60.6184079.140HT−1.48353143.100300GREG−47.4772508.530IVC−47.4792508.430HT−0.83242642.500350GREG−37.6131582.320IVC−37.6141582.230Table 2Monte Carlo Biases and Mean Square Error (MSE) with two endogenous variables i.e. and are endogenous.Table 2. Sample SizeEstimatorBiasMSEHT−11.0451042904.00025GREG−778.750621374.000IVC−778.730621333.000HT−1.195532269.00050GREG−382.420156285.000IVC−382.420156285.000HT−4.619237769.000100GREG−183.96036833.000IVC−183.96036832.800HT−15.471151008.000150GREG−114.26014326.400IVC−114.26014326.100HT−8.883111323.000200GREG−80.4347160.840IVC−80.4377160.200HT−6.96582023.700250GREG−60.5984078.120IVC−60.5994078.000HT−2.24360966.400300GREG−47.4732509.380IVC−47.4742509.330HT−0.40650913.400350GREG−37.6011581.900IVC−37.6011581.870Table 3Monte Carlo Bias and Mean Square Error (MSE) with two endogenous variables i.e. and are endogenous.Table 3. Sample SizeEstimatorBiasMSEHT−11.489615217.00050GREG−383.490156733.000IVC−383.350156646.000HT−2.277358471.00080GREG−231.53058098.100IVC−231.39058049.500HT−0.530284893.000100GREG−181.74036030.700IVC−181.64036008.700HT−4.183175080.000150GREG−114.07014434.700IVC−114.01014429.400HT−0.990129327.000200GREG−79.7347123.410IVC−79.7017121.280HT−1.66098590.300250GREG−59.8824082.220IVC−59.8754084.980HT−0.98575622.400300GREG−46.7412513.170IVC−46.7232513.600HT−2.69159559.900350GREG−37.2091610.760IVC−37.1991612.290Table 4Monte Carlo Bias and Mean Square Error (MSE) with three endogenous variables i.e. , are endogenous.Table 4. Sample SizeEstimatorBiasMSEHT−7.7941145201.00025GREG−778.819621504.800IVC−778.757621387.700HT−10.167588039.30050GREG−382.498156348.000IVC−382.494156340.400HT−4.504262701.600100GREG−183.99936853.180IVC−183.99536850.250HT−10.167588039.300150GREG−382.498156348.000IVC−382.494156340.400HT10.501124380.400200GREG−80.4367162.820IVC−80.4387161.104HT−6.81991988.110250GREG−60.5824077.529IVC−60.5804077.112HT2.15567202.490300GREG−47.4702510.695IVC−47.4692510.488HT−1.04956529.500350GREG−37.5891581.869IVC−37.5891581.771Table 5Monte Carlo Biases and Mean Square Error (MSE) with three endogenous variables i.e. , are endogenous.Table 5. Sample SizeEstimatorBiasMSEHT−13.856562018.30050GREG−286.84991032.490IVC−286.93291175.830HT−6.788335671.80080GREG−173.47833916.380IVC−173.19033840.500HT−9.501252110.500100GREG−133.82020549.370IVC−133.71520528.650HT−9.495166003.000150GREG−85.7688706.890IVC−85.5758680.110HT−14.421125141.100200GREG−61.1054579.970IVC−60.9824579.366HT−12.31386838.950250GREG−45.3912600.937IVC−45.3962602.538HT−6.04871820.870300GREG−35.2981611.601IVC−35.3471610.860HT−3.51255557.950350GREG−27.8491018.425IVC−27.8801020.369
Table .1 shows the results of HT, GREG, and IVC in the form of Bias and MSE for . For all the sample sizes and , the Mean Square Error (MSE) of the proposed Instrumental-Variable Calibrated (IVC) estimator is smaller than the HT and GREG estimators. Table 2 shows the results obtained for similar conditions for two endogenous variables, , for different sample sizes. The Mean Square Error (MSE) of HT and GREG is larger than the proposed Instrumental-variable calibrated (IVC) estimator. Table 3 shows the results obtained for similar conditions for for different sample sizes, the Mean Square Error (MSE) of HT and GREG is larger than the proposed Instrumental-Variable Calibrated (IVC) estimator. Table 4, Table 5 show the results for three endogenous variables, , for different sample sizes in both cases. The Mean Square Error (MSE) of the proposed Instrumental-Variable Calibrated (IVC) estimator is smaller than HT and GREG estimators.
The results show that the proposed Instrumental-Variable Calibrated (IVC) estimator gave the smaller Mean Square Error (MSE) for small and large sample sizes. So Instrumental-Variable Calibrated (IVC) estimator improves the efficiency over conventional Calibration.
Real data example
5.3
To compare the proposed estimators with the Horvitz-Thompsons and conventional calibration estimators (GREG estimator). We used a real data example. The data given by Singh et al. [24] is used to evaluate the model performance. The data are freely and publicly accessible for use at: http://www.kiran.nic.in/pdf/Social_Science/elearning/How_to_Test_Endogeneity_or_Exogeneity_using_SAS-1.pdf. Eight variables of size (N = 376) are in the dataset including Min_Tem (Minimum Temperature), Rain (Average Rainfall), Foodgrain_Yield (Yield of food grain), Latitude (Latitude of a particular location), Longitude (Longitude of a particular location), Foodgrain_yld_FD (First difference of Foodgrain_Yield), Min_Tem_FD (First difference of Min_Tem), Rain_FD (First difference of rain), where the Yield of food grain is a dependent variable. The Auxiliary variables are Minimum Temperature and Rain, and the other five variables, Latitude, Longitude, Foodgrain_yld_FD, Min_Tem_FD, and Rain_FD, are selected as instrumental variables. The Auxiliary variable has already endogeneity reported, so we use the instrumental variables instead of the endogenous auxiliary variables to evaluate the model performance. We considered this data as population data and take a sample of size n = 25, 50, 75, 100, 150, 200, and 250 using SRSWOR.
Real data results
5.4
Table .6 presents the results of the three estimators and their Bias and Mean Square Error (MSE) for different sample sizes. When the auxiliary variable, Minimum Temperature, is endogenous, the variable Longitude of a particular location is used as an instrumental variable. The results show that the proposed Instrumental-Variable Calibrated (IVC) estimator has a smaller Mean Square Error (MSE) than HT and GREG estimators when there is a problem of Endogeneity present in the dataset in case of exact identification. This shows that the proposed estimator is more efficient than the HT and GREG.Table 6. Real data Average Bias and Average Mean Square Error (MSE) with one endogenous variable.Table 6. Sample SizeEstimatorBiasMSEHT−184.310150001850425GREG−2690.5051437499886IVC−2403.4271428790986HT−584.41472361304750GREG−105.868664012946IVC−33.187661193157HT−676.41447078651875GREG−932.468434993010IVC−915.803434817082HT−113.956340054572100GREG−266.031300480833IVC−240.996300443340HT−203.112182342767150GREG−220.419165836688IVC−181.663165650642HT−19.471105476083200GREG−160.46396909490IVC−149.51796621549HT−162.38060052672250GREG−225.58453665612IVC−220.43753623346
Conclusion
6
In survey sampling, the calibration restrictions are significant. In this paper, the Instrumental-variable calibration technique is used to find the optimum estimators in the presence of the problem of endogeneity. In Monte-Carlo simulation study and real data example, we examined the performance of the proposed estimator for different sample sizes drawn by simple random sampling without replacement from a finite population. The proposed Instrumental-Variable Calibrated (IVC) estimator in terms of Mean Square Error (MSE) is more efficient than HT and GREG estimators under different sample sizes and varying endogenous variables. The proposed estimator is more efficient as sample size increases. The present study is limited to the exact identification means that the number of instrumental variables equals the number of endogenous variables Further investigation of the over-identification problem is the topic of future research.
CRediT authorship contribution statement
Muhammad Nadeem Intizar: Writing – original draft, Validation, Methodology, Investigation, Formal analysis, Data curation. Muhammad Ahmed Shehzad: Writing – review & editing, Visualization, Validation, Supervision, Software, Resources, Project administration, Methodology, Investigation, Conceptualization. Haris Khurram: Writing – review & editing, Validation, Supervision, Software, Project administration, Methodology, Investigation, Formal analysis. Soofia Iftikhar: Visualization, Methodology, Investigation, Formal analysis. Aamna Khan: Writing – review & editing, Visualization, Validation, Resources, Investigation, Formal analysis, Data curation. Abdul Rauf Kashif: Writing – original draft, Resources, Investigation, Formal analysis.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Liu X.Arslan M.A general class of estimators on estimating population mean using the auxiliary proportions under simple and two phase sampling AIMS Mathematics 61220211359213607
- 2Ahmad S.Arslan M.Khan A.Shabbir J.A generalized exponential-type estimator for population mean using auxiliary attributes P Lo S One 1652021 e 024694710.1371/journal.pone.0246947 PMC 811835433983938 · doi ↗ · pubmed ↗
- 3Wang J.Ahmad S.Arslan M.Lone S.A.Abd Ellah A.H.Aldahlan M.A.Elgarhy M.Estimation of finite population mean using double sampling under probability proportional to size sampling in the presence of extreme values Heliyon 9112023 e 2141810.1016/j.heliyon.2023.e 21418 PMC 1059853537885711 · doi ↗ · pubmed ↗
- 4Deville J.C.Särndal C.E.Calibration estimators in survey sampling J. Am. Stat. Assoc.874181992376382
- 5Estevao M.V.Särndal E.C.A functional form approach to calibration J. Off. Stat.1642000379399
- 6Cassel C.M.Särndal C.E.Wretman J.H.Some results on generalized difference estimation and generalized regression estimation for finite populations Biometrica 631976615620
- 7Cardot H.Goga C.Shehzad M.A.Calibration and partial calibration on principal components when the number of auxiliary variables is large Stat. Sin.272017243260
- 8Shehzad M.A.Penalization and data reduction of auxiliary variables in survey sampling General Mathematics [math.GM].2012 Université de Bourgogne 2012. English. NNT:2012 DIJOS 010.tel-00812880 https://theses.hal.science/tel-00812880/document
