TL;DR
This paper introduces a new inference method that accurately compares two populations despite sampling biases, maintaining low false positive rates where standard methods fail.
Contribution
The authors develop a bias-resilient inference technique that controls false positives under moderate sampling biases, improving reliability over traditional methods.
Findings
Method performs well on synthetic data
Effective on real biomarker datasets
Reduces false positives under bias
Abstract
In many applications, different populations are compared using data that are sampled in a biased manner. Under sampling biases, standard methods that estimate the difference between the population means yield unreliable inferences. Here we develop an inference method that is resilient to sampling biases and is able to control the false positive errors under moderate bias levels in contrast to the standard approach. We demonstrate the method using synthetic and real biomarker data.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Effect Inference from Two-Group Data
with Sampling Bias
Dave Zachariah and Petre Stoica This work has been partly supported by the Swedish Research Council (VR) under contract 2018-05040.
Abstract
In many applications, different populations are compared using data that are sampled in a biased manner. Under sampling biases, standard methods that estimate the difference between the population means yield unreliable inferences. Here we develop an inference method that is resilient to sampling biases and is able to control the false positive errors under moderate bias levels in contrast to the standard approach. We demonstrate the method using synthetic and real biomarker data.
I Introduction
In many applications of statistical inference, the aim is to compare data from different populations. Specifically, given and samples from two groups, collected in vectors and , the target quantity is often the difference between their means, denoted , which we call the effect. For instance, in randomized trials and A/B testing, the data are outcomes from two populations and is the average causal effect of assigning subjects to a test group ‘’ as compared to a control group ‘[math]’. [1, 2] The standard approach is to use the difference between sample averages in each group, viz. , where . Confidence intervals for can be obtained using Welch’s method, which employs an approximating t-distribution [3, 4, 5]. Inferring is equivalent to detecting that the means of two distributions differ, which is a classical problem in statistical signal processing [6, 7].
Ideally, the samples from both groups are representative of their target populations. Then the bias of the estimator,
[TABLE]
is zero. However, in nonideal conditions with finite samples this is not the case, e.g., when some units of the intended populations are less likely to be included than others. Under such conditions, decreases with sample sizes and but will nevertheless be nonzero. Sampling biases increase the risk of inferring spurious effects when using standard inference methods.
In this paper, we develop an inference method that is resilient to sampling biases. In contrast to the standard approach, the proposed method reduces the risk of reporting spurious effect estimates and is capable of controlling the false positive errors under moderate biases. The method relies on an effect estimator using a fully automatic and data-adaptive regularization. We demonstrate its performance on both synthetic and real data.
Remark 1*.*
Code for the method can be found at https://github.com/dzachariah/two-groups-data
II Problem formulation
We model the dataset as
[TABLE]
The model based on the Gaussian distribution yields the least favourable distribution for estimating the unknown effect [8]. We model the effect as a random variable, where different ranges of values of have different probabilities. To achieve resilliance to sampling biases, we adopt a conservative approach in which nonexistant or negligible effects are considered to be more probable. Specifically, we employ the following model:
[TABLE]
where is an unknown parameter.
Our aim is to derive a confidence interval that contains the unknown with a coverage probability of at least . That is,
[TABLE]
The confidence interval is to be centered on an estimator and should be resilient to sampling biases. That is, even if the interval must not indicate nonzero effects with a probability greater than . Fig. 1 illustrates the ability of the method proposed below to ensure (3) under a range of biases, provided does not greatly exceed the dispersion of sample averages, i.e., .
We will derive a confidence interval using model (1) and (2), with nuisance parameters
[TABLE]
III Proposed method
Let be the conditional mean of the effect given the data. Using an estimate of the nuisance parameters, we propose the following effect estimator
[TABLE]
where we introduce the variable that can be interpreted as a signal-to-noise ratio, see [9] for a derivation.
Result 1** (Cramér-Rao bound).**
When the systematic error of is invariant with respect to , then the mean-squared error over all possible effects and data has a Cramér-Rao bound where
[TABLE]
Proof.
See Appendix -A. ∎
Result 2** (Confidence interval).**
Let
[TABLE]
When using an efficient estimator that attains the bound (5), the interval in (6) satifies the specified coverage probability (3).
Proof.
See Appendix -B. ∎
Evaluating and requires estimates of the nuisance parameters . Here we adopt the maximum likelihood approach and estimate using the marginalized data distribution,
[TABLE]
It can be shown that (7) is a Gaussian distribution [9] with mean and covariance
[TABLE]
The estimated parameters are given by
[TABLE]
which can be shown to yield an asymptotically efficient estimator (4) [10, corr. 9].
Interestingly, the problem (8) can be solved by a one-dimensional numerical search. Begin by defining the variables
[TABLE]
Note that . Then the following result holds.
Result 3** (Nuisance parameter estimates).**
The estimated variances are given by
[TABLE]
[TABLE]
which are ensured to be nonnegative, and , where
[TABLE]
All variables in (9)-(11) are functions of the mean , whose estimate is obtained by minimizing the one-dimensional function
[TABLE]
Proof.
See Appendix -C. ∎
By plugging in , , and into (4) and (6), we obtain estimates and , respectively. We note that the overall mean is fitted to the data in a nonstandard manner using (12), which yields a fully automatic and data-adaptive regularization of the effect estimator (4). If the minimizing is such that , then the estimated signal-to-noise ratio is . In this case, the method indicates that the data is not sufficiently informative to discriminate any systematic difference from noise. Consequently, collapses to zero and , indicating a case in which the effect cannot be reliably inferred.
IV Experimental results
We demonstrate the proposed inference method using both synthetic and real data.
IV-A Synthetic data
We generate two-group data using the model (1) and add a negative bias to the test group, using the setup parameters described in Fig. 1. The adaptive regularization of is illustrated in Fig. 2: when the unknown effect is nonexistent, , the estimates are concentrated at zero, despite the bias . As exceeds the dispersion of the sample averages, however, the regularized and standard estimators become nearly identical.
We report a significant effect estimate when a nonempty interval excludes the zero effect. Fig. 3 illustrates the ability of the proposed method to control the false positive error probability as increases, in contrast to the standard method. This is achieved while incurring a loss of statistical power that vanishes as the number of samples increases.
IV-B Prostate cancer data
We now consider real data from healthy individuals and individuals with prostate cancer [11, 12]. The data contains 6033 different biomarker responses. The inferred effects are shown in Fig. 4. For 6 markers, the effects were found to be significant at the level. By contrast, the standard approach using Welch’s t-intervals yields 478 genes, but the inferences are less reliable under sampling biases.
V Conclusions
We developed a method for inferring effects in two-group data that, unlike the standard approach, is resilient to sampling biases. The method is able to control the false positive errors under moderate bias levels and its performance was demonstrated using both synthetic and real biomarker data.
-A The derivation of the Cramér-Rao bound
The mean-square error can be decomposed as
[TABLE]
where is the conditional mean. Next, define the score function and the information matrix,
[TABLE]
Since the marginal pdf is Gaussian, we can compute using Slepian-Bangs formula [13]. It has a block diagonal form
[TABLE]
where
[TABLE]
and .
Let denote the correlation between the score function and estimation error. Then we have the general bound
[TABLE]
In our case, we obtain
[TABLE]
where the fourth line follows under the constant bias assumption. Inserting this expression for in (17) yields
[TABLE]
This completes the proof.
-B The derivation of the confidence interval
We have that
[TABLE]
Let , then
[TABLE]
Thus \Pr\big{\{}\delta\in C_{\alpha}(\mathbf{y})\big{\}}\geq 1-\alpha when the estimator is efficient.
-C The derivation of the concentrated cost
Problem (8) can be formulated equivalently as the minimization of:
[TABLE]
The minimizer
[TABLE]
is inserted back to yield a concentrated cost function
[TABLE]
Next, using the Sherman-Morrison and matrix determinant lemmas we can reparametrize as
[TABLE]
where we dropped the subindices for notational convenience.
Using the identities , and , the minimizing of (23) is found as (10). Inserting the variance estimate back, yields a concentrated cost function
[TABLE]
To find the minimizing , we first consider the stationary point of
[TABLE]
Taking the derivative with respect to , yields the following condition for a stationary point:
[TABLE]
or equivalently . Solving for , we obtain the estimate (11).
By evaluating the second derivative at this point, we verify that it is a minimum. Inserting (11) back into (24) and combining with (22), we can write (20) in the concentrated form (12) after omitting irrelevant constants.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] G. Imbens and D. Rubin, Causal Inference in Statistics, Social, and Biomedical Sciences . Cambridge University Press, 2015.
- 2[2] J. Pearl, M. Glymour, and N. Jewell, Causal Inference in Statistics: A Primer . Wiley, 2016.
- 3[3] C. Rao, Linear Statistical Inference and its Applications . Wiley Series in Probability and Statistics, Wiley, 1973.
- 4[4] B. L. Welch, “The significance of the difference between two means when the population variances are unequal,” Biometrika , vol. 29, no. 3/4, pp. 350–362, 1938.
- 5[5] S.-H. Kim and A. S. Cohen, “On the behrens-fisher problem: a review,” Journal of Educational and Behavioral Statistics , vol. 23, no. 4, pp. 356–377, 1998.
- 6[6] H. Van Trees, K. Bell, and Z. Tian, Detection Estimation and Modulation Theory, Part I: Detection, Estimation, and Filtering Theory . Detection Estimation and Modulation Theory, Wiley, 2013.
- 7[7] S. Kay, Fundamentals of Statistical Signal Processing: Detection theory . Fundamentals of Statistical Signal Processing, PTR Prentice-Hall, 1993.
- 8[8] P. Stoica and P. Babu, “The Gaussian data assumption leads to the largest Cramér-Rao bound [lecture notes],” IEEE Signal Processing Magazine , vol. 28, no. 3, pp. 132–133, 2011.
