Coping with Selection Effects: A Primer on Regression with Truncated   Data

Adam B. Mantz (KIPAC; Stanford)

arXiv:1901.10522·astro-ph.IM·October 14, 2020

Coping with Selection Effects: A Primer on Regression with Truncated Data

Adam B. Mantz (KIPAC, Stanford)

PDF

TL;DR

This paper discusses statistical methods for performing regression analysis on truncated data, common in astronomy, highlighting how to account for selection effects and improve inference accuracy.

Contribution

It provides a general framework for regression with truncated data, including computational strategies and conditions where selection effects can be ignored.

Findings

01

Truncation affects the estimation of variables in regression models.

02

Modified models can account for undetected sources in incomplete data.

03

Recommendations for computational approaches to truncated data regression.

Abstract

The finite sensitivity of instruments or detection methods means that data sets in many areas of astronomy, for example cosmological or exoplanet surveys, are necessarily systematically incomplete. Such data sets, where the population being investigated is of unknown size and only partially represented in the data, are called "truncated" in the statistical literature. Truncation can be accounted for through a relatively straightforward modification to the model being fitted in many circumstances, provided that the model can be extended to describe the population of undetected sources. Here I examine the problem of regression using truncated data in general terms, and use a simple example to show the impact of selecting a subset of potential data on the dependent variable, on the independent variable, and on a second dependent variable that is correlated with the variable of interest.…

Figures7

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Model parameters used to generate the mock data set in Section 4 . See Equations 10 – 37 .

Parameter	Value	Parameter	Value
$N$	$10^{4}$	$σ_{1}$	0.4
$λ$	2	$σ_{2}$	0.15
$α_{1}$	0	$ρ$	0.5
$α_{2}$	0	$s_{x}$	0.2
$β_{1}$	1	$s_{y_{1}}$	0.05
$β_{2}$	0.7	$s_{y_{2}}$	0.1

Equations31

L = N = \hat{N}_{det} \sum \infty p (N) (N ^ _{det} N) L_{det} ⟨ f_{mis} ⟩^{N_{mis}},

L = N = \hat{N}_{det} \sum \infty p (N) (N ^ _{det} N) L_{det} ⟨ f_{mis} ⟩^{N_{mis}},

⟨ f_{mis} ⟩ = \int d x d y d \overset{x}{^} d \overset{y}{^} p (x ∣Ω) p (y ∣ x, θ) p (\overset{x}{^}, \overset{y}{^} ∣ x, y) [1 - P_{det} (\overset{x}{^}, \overset{y}{^} ∣ ϕ)],

⟨ f_{mis} ⟩ = \int d x d y d \overset{x}{^} d \overset{y}{^} p (x ∣Ω) p (y ∣ x, θ) p (\overset{x}{^}, \overset{y}{^} ∣ x, y) [1 - P_{det} (\overset{x}{^}, \overset{y}{^} ∣ ϕ)],

L_{det} = i = 1 \prod \hat{N}_{det} \int d x_{i} d y_{i} p (x_{i} ∣Ω) p (y_{i} ∣ x_{i}, θ) p (\overset{x}{^}_{i}, \overset{y}{^}_{i} ∣ x_{i}, y_{i}) P_{det} (\overset{x_{i}}{^}, \overset{y_{i}}{^} ∣ ϕ),

L_{det} = i = 1 \prod \hat{N}_{det} \int d x_{i} d y_{i} p (x_{i} ∣Ω) p (y_{i} ∣ x_{i}, θ) p (\overset{x}{^}_{i}, \overset{y}{^}_{i} ∣ x_{i}, y_{i}) P_{det} (\overset{x_{i}}{^}, \overset{y_{i}}{^} ∣ ϕ),

L_{sim} = i = 1 \prod \hat{N}_{det} \int d x_{i} d y_{i} p (x_{i} ∣Ω) p (y_{i} ∣ x_{i}, θ) p (\overset{x}{^}_{i}, \overset{y}{^}_{i} ∣ x_{i}, y_{i}) .

L_{sim} = i = 1 \prod \hat{N}_{det} \int d x_{i} d y_{i} p (x_{i} ∣Ω) p (y_{i} ∣ x_{i}, θ) p (\overset{x}{^}_{i}, \overset{y}{^}_{i} ∣ x_{i}, y_{i}) .

L = L_{det} ⟨ f_{det} ⟩^{- \hat{N}_{det}},

L = L_{det} ⟨ f_{det} ⟩^{- \hat{N}_{det}},

⟨ f_{det} ⟩

⟨ f_{det} ⟩

L

L

p (⟨ N ⟩) L

p (⟨ N ⟩) L

\int_{0}^{\infty} d ⟨ N ⟩ p (⟨ N ⟩) L

p (N) = \int d ⟨ N ⟩ p (⟨ N ⟩) p (N ∣ ⟨ N ⟩) = \frac{Γ ( N + α _{0} )}{N ! Γ ( α _{0} + 1 )} (β_{0} + 1)^{- N} (\frac{β _{0}}{β _{0} + 1})^{α_{0}},

p (N) = \int d ⟨ N ⟩ p (⟨ N ⟩) p (N ∣ ⟨ N ⟩) = \frac{Γ ( N + α _{0} )}{N ! Γ ( α _{0} + 1 )} (β_{0} + 1)^{- N} (\frac{β _{0}}{β _{0} + 1})^{α_{0}},

p (x ∣ λ) = λ e^{- λ x},

p (x ∣ λ) = λ e^{- λ x},

\displaystyle p\left[\left.\left(\begin{array}[]{c}y_{1}\\ y_{2}\end{array}\right)\right|x\right]

\displaystyle p\left[\left.\left(\begin{array}[]{c}y_{1}\\ y_{2}\end{array}\right)\right|x\right]

\displaystyle p\left[\left.\left(\begin{array}[]{c}\hat{x}\\ \hat{y}_{1}\\ \hat{y}_{2}\end{array}\right)\right|\left(\begin{array}[]{c}x\\ y_{1}\\ y_{2}\end{array}\right)\right]

\displaystyle p\left[\left.\left(\begin{array}[]{c}\hat{x}\\ \hat{y}_{1}\\ \hat{y}_{2}\end{array}\right)\right|\left(\begin{array}[]{c}x\\ y_{1}\\ y_{2}\end{array}\right)\right]

⟨ f_{det} ⟩ = \int d x λ e^{- λ x} 1 - Φ \frac{y _{1, lim} - α _{1} - β _{1} x}{σ _{1}^{2} + s _{y_{1}}^{2}},

⟨ f_{det} ⟩ = \int d x λ e^{- λ x} 1 - Φ \frac{y _{1, lim} - α _{1} - β _{1} x}{σ _{1}^{2} + s _{y_{1}}^{2}},

⟨ f_{det} ⟩ = \int d x λ e^{- λ x} [1 - Φ (\frac{x _{lim} - x}{s _{x}})] .

⟨ f_{det} ⟩ = \int d x λ e^{- λ x} [1 - Φ (\frac{x _{lim} - x}{s _{x}})] .

p\left[\left.\left(\begin{array}[]{c}y_{1}\\ y_{2}\end{array}\right)\right|x\right]=\mathcal{N}\left(y_{1}\left|x,\sigma_{1}^{2}\right.\right)\mathcal{N}\left[y_{2}\left|\alpha_{2}+\beta_{2}x+\rho\frac{\sigma_{2}}{\sigma_{1}}(y_{1}-\alpha_{1}-\beta_{1}x),\,\sigma_{2}^{2}(1-\rho^{2})\right.\right].

p\left[\left.\left(\begin{array}[]{c}y_{1}\\ y_{2}\end{array}\right)\right|x\right]=\mathcal{N}\left(y_{1}\left|x,\sigma_{1}^{2}\right.\right)\mathcal{N}\left[y_{2}\left|\alpha_{2}+\beta_{2}x+\rho\frac{\sigma_{2}}{\sigma_{1}}(y_{1}-\alpha_{1}-\beta_{1}x),\,\sigma_{2}^{2}(1-\rho^{2})\right.\right].

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Coping with Selection Effects:

A Primer on Regression with Truncated Data

Adam B. Mantz1,2

1Kavli Institute for Particle Astrophysics and Cosmology, Stanford University, 452 Lomita Mall, Stanford, CA 94305, USA

2Department of Physics, Stanford University, 382 Via Pueblo Mall, Stanford, CA 94305, USA Corresponding author e-mail: [email protected]

(Submitted 24 July 2018, accepted 28 January 2019)

Abstract

The finite sensitivity of instruments or detection methods means that data sets in many areas of astronomy, for example cosmological or exoplanet surveys, are necessarily systematically incomplete. Such data sets, where the population being investigated is of unknown size and only partially represented in the data, are called “truncated” in the statistical literature. Truncation can be accounted for through a relatively straightforward modification to the model being fitted in many circumstances, provided that the model can be extended to describe the population of undetected sources. Here I examine the problem of regression using truncated data in general terms, and use a simple example to show the impact of selecting a subset of potential data on the dependent variable, on the independent variable, and on a second dependent variable that is correlated with the variable of interest. Special circumstances in which selection effects are ignorable are noted. I also comment on computational strategies for performing regression with truncated data, as an extension of methods that have become popular for the non-truncated case, and provide some general recommendations.

keywords:

methods: data analysis – methods: statistical

1 Introduction

As astronomers increasingly adopt Bayesian methods, and computational resources continue to improve, ubiquitous features of astronomical data and models that are difficult to address through classical statistical methods based on the generalized linear model are now routinely dealt with. Among these are measurement errors on the independent variables of a regression, correlation in the measurements of independent and dependent variables (hereafter called covariates and responses, respectively), and the presence of intrinsic scatter. In addition to general-purpose Markov Chain Monte Carlo tools, easy-to-use codes for specialized but reasonably generic problems have been provided to and socialized within the community. Of particular note is linmix_err (Kelly, 2007), which uses conjugate Gibbs sampling to efficiently fit a model consisting of a linear mean relation, Gaussian measurement and intrinsic scatters, and a Gaussian mixture prior distribution of the covariates. While these are strong modeling assumptions, many of them are also fairly common, irrespective of the fitting method employed. The same approach has been generalized to multivariate regression (lrgs; Mantz 2016) and applied using an off-the-shelf Gibbs sampling environment (Sereno et al., 2015).

A common and problematic feature of astronomical data that these specialized tools do not address is selection bias, resulting in truncation of the observed data set. This refers to the situation in which the data set available for analysis is not representative of the complete population that we wish to make inferences about, and furthermore that even the size of that complete population is not known. Note that this scenario is distinct from that of “censored” data, in which a subset of measurements are unavailable even though the size of the complete data set is known and fixed. Modifications of classical nonparametric estimators have been developed to address truncation (Efron & Petrosian 1992, 1999). In the Bayesian framework, the solution is to incorporate the selection mechanism into a generative model for the data; this necessitates modeling the full population, including undetected (but potentially detectable) sources. An unavoidable feature of inference on truncated data, which becomes explicit in the Bayesian formulation, is that we must have a model to describe the portion of the complete population that is not observed.

In astronomy, selection effects such as Eddington and Malmquist biases have been discussed at least since the eponymous works of Eddington (1913) and Malmquist (1922, 1925). Their importance has been recognized for cosmological surveys, especially in the context of the abundance and scaling relations of clusters of galaxies (e.g., recently, Pratt et al. 2009; Vikhlinin et al. 2009; Mantz et al. 2010a, b; Allen et al. 2011), and the distance-redshift relation of type Ia supernovae (March et al., 2018). Similar selection effects are clearly also present in, for example, exoplanet surveys (e.g. Youdin 2011; Gaidos & Mann 2013), and have been discused in the context of quasar and gamma ray burst data sets (Efron & Petrosian, 1994; Petrosian et al., 2015). The discussion below thus has applicability in many areas.111Indeed, after this work was submitted and was in revision, Mandel et al. (2018) wrote about a application of a similar framework to gravitational-wave astrophysics, with an emphasis on recovering the distribution of covariates.

The purpose of this work is twofold. First, I hope to provide an understandable overview of how truncation can be incorporated into the likelihood function in general, as well as more concretely for a few specific (and simple) selection mechanisms. This will include some discussion of the special circumstances in which selection effects are ignorable, i.e. when the likelihood need not be modified. The emphasis is on regression (that is, recovering the parameters of a linear relation), although the basic approach is more general. Second, in simple cases where selection is not ignorable, constraints on a toy model obtained using the correct likelihood will be contrasted with those from methods analogous to the codes mentioned above, which do not account for selection. This is not to impugn those codes particularly, but to emphasize that failing to account for selection in an analysis has the potential to seriously compromise the results. In the conclusions, I will comment briefly on computational strategies for performing the complete analysis. The code used to perform the fits to mock data in this work is available as an extension of the Python implementation of lrgs.222https://github.com/abmantz/lrgs

2 A Concrete Scenario

While many aspects of the model framework developed in the next section are general, it is helpful to have a specific problem in mind for illustration. Therefore, consider the closely related tasks of studying the cosmology and scaling relations of galaxy clusters. The key ingredients of the model, and the notation used in this work, are as follows.

•

The population of clusters in the Universe is described theoretically by a mass function, i.e. their number density as a function of mass and redshift. The mass function is determined by cosmological parameters (e.g. Press & Schechter 1974), which will be collectively denoted $\Omega$ . Mass and redshift can be thought of as the covariates of a regression (below), and so are denoted $x$ . Thus, the mass function, apart from a normalization, can be thought of as the a priori probability for a cluster to have a given mass and redshift, $p(x|\Omega)$ . The normalization can be parametrized by $N$ , the size of the complete population of interest. The interpretation of $N$ will depend on exactly what range of $x$ is used to define the population under study; in practice, the only requirement is that all sources that could plausibly be detected are included in this definition.

•

A given cluster generates various observable signals such as the mass, temperature, X-ray luminosity and Sunyaev-Zel’dovich signal of the intracluster gas; the number and optical/IR luminosity of its galaxies; and the gravitational lensing shear induced on background galaxies by the cluster’s mass. These depend on the cluster mass and may evolve, and so can be thought of as response variables of a regression, $y$ . Note that $y$ refers the true value of an observable quantity, not to the observed value, which is subject to measurement error. The average scaling of $y$ with $x$ is generally modeled as a power law (that is, a line if $x$ and $y$ actually refer to the logarithm of mass, etc.). In addition to this average behavior, there is an intrinsic scatter in the values of $y$ for a given $x$ . The scaling relations are thus described by a distribution $p(y|x,\theta)$ , where $\theta$ parametrizes both the average scaling relation(s) and the intrinsic scatter.

Here I have implicitly assumed that the parameters represented by $\Omega$ and $\theta$ are distinct. This need not always be true, but it is reasonably common separation, in this case reflecting a distinction between cosmological and astrophysical models.

•

Measured (or potentially measured) values of the properties of a cluster will be denoted $\hat{x}$ or $\hat{y}$ ; these are related to the true values by a sampling distribution, $p(\hat{x},\hat{y}|x,y)$ . This notation includes the possibility of correlations in the measurement errors. For simplicity, I will assume that the sampling distribution as a function of $x$ and $y$ is known, so that no additional parameters need to appear explicitly. The measured data additionally include $\hat{N}_{\mathrm{det}}$ , the number of clusters detected in the survey.

In the galaxy cluster case, $\hat{x}$ could include spectroscopic measurements of redshift. However, measurements of mass are less straightforward, and in general any measured proxy for the mass may have an intrinsic scatter which correlates at fixed true mass with one of the observables in $y$ . In practice, it therefore makes sense to include such mass proxies (including mass estimated from gravitational lensing) as response variables, with theoretical priors constraining the attendant scaling relation parameters. In practice, the same set of measurements need not be available for all detected clusters, with the exception of the survey measurement(s) used to detect them to begin with.

•

The probability for a cluster to be detected (that is, included in the data set) as a function of measured (or potentially measured) properties is denoted $P_{\mathrm{det}}(\hat{x},\hat{y}|\phi)$ . This may depend on additional parameters, $\phi$ , such as a completeness or a flux limit.

In practice, the detection process is generally a deterministic function of the survey data, so $P_{\mathrm{det}}$ can be written as a function of only $\hat{x}$ , $\hat{y}$ and $\phi$ for an appropriate definition of $x$ and $y$ . In fact, it is frequently possible to express $P_{\mathrm{det}}$ as a step function. For example, consider the scenario in which detection requires a measured flux to exceed a position-dependent threshold (corresponding to non-uniform survey depth). With position on the sky included in $x$ and flux included in $y$ , $P_{\mathrm{det}}$ has the form of a step function, dependent on position and measured flux. However, there is no real benefit to expressing things this way, since a cluster’s position on the sky is typically both well determined (effectively without error) and not otherwise of interest. Hence, one might instead define $P_{\mathrm{det}}$ in terms of measured flux only, with $P_{\mathrm{det}}$ proportional to the fraction of the survey footprint where the threshold for detection is less than a given value. Conversely, a metric often used to characterize cluster surveys is the detection probability as a function of (true) mass. However, writing $P_{\mathrm{det}}$ this way requires a marginalization over the dependent variable(s) associated with the survey detection (luminosity in the above example) as well as the corresponding scaling relation parameters. Consequently, this is not the most natural way to express $P_{\mathrm{det}}$ in the likelihood developed in the next section.

To summarize this scenario, $N$ and $\Omega$ determine the number density of clusters as a function of redshift and mass ( $x$ ); one or more response variables ( $y$ ) for each cluster follow from its redshift, mass and the scaling relation parameters, $\theta$ ; potentially measured values $\hat{x}$ or $\hat{y}$ follow from $x$ , $y$ and the sampling distribution; and these measurements combined with the detection probability result in $\hat{N}_{\mathrm{det}}$ clusters, and their measured values, forming the available data set. This level of generality will be maintained in Section 3. In Section 4, we will specify the scenario even further in order to illustrate the impact of selection effects on a toy data set.

3 Theory

3.1 Likelihood

At its most abstract level, the problem at hand is that of modeling the properties of some population of sources in the Universe, when a fraction of that population is systematically missing from our data set. Using the notation introduced in Section 2, our model thus divides the $N$ sources in the complete population into $\hat{N}_{\mathrm{det}}$ that are represented in the data and $N_{\mathrm{mis}}=N-\hat{N}_{\mathrm{det}}$ that are missing from the data. If we assume that sources are independent from one another both in their occurrence and detection, the likelihood of the data, marginalized over $N$ , can be written

[TABLE]

with $p(N)$ the prior distribution for $N$ . Here $\left\langle f_{\mathrm{mis}}\right\rangle$ is the a priori probability that a given source is not detected,

[TABLE]

which appears once for each of the $N_{\mathrm{mis}}$ undetected sources. This construction explicitly shows the integrations required to express a completeness function (or effective $P_{\mathrm{det}}$ ) in terms of true properties $x$ and/or $y$ . Note also that this framework requires the sampling distribution, $p(\hat{x},\hat{y}|x,y)$ , to be defined for a generic source, at least for those observables involved in detection. That is, we need a generative model for the measurement errors involved in the survey detection process, not just error bars for the detected sources estimated from the data. $\mathcal{L}_{\mathrm{det}}$ is the likelihood associated with detections,

[TABLE]

where the factorization into a product relies on our assumption that the sources occur independently. The integral in this expression differs from that in Equation 2 both in the substitution $1-P_{\mathrm{det}}\rightarrow P_{\mathrm{det}}$ and in that $\hat{x}_{i}$ and $\hat{y}_{i}$ are fixed by observation rather than being marginalized over. In the common case where $\phi$ is constant, $P_{\mathrm{det}}(\hat{x_{i}},\hat{y_{i}}|\phi)$ is also a constant for all observed sources, and this factor is equivalent to the simplified likelihood (applicable in the absence of selection effects),

[TABLE]

The combinatoric factor in Equation 1, ${N\choose\hat{N}_{\mathrm{det}}}=N!/\hat{N}_{\mathrm{det}}!N_{\mathrm{mis}}!$ , appears because the sources are a priori exchangeable.

3.2 Prior Distributions for $N$

Since the complete population size, $N$ , has been introduced, we will need to assign it a prior distribution. For example, Gelman et al. (2004) note that if $p(N)\propto N^{-1}$ (uniform in $\log N$ ), the sum in Equation 1 can be done analytically, yielding

[TABLE]

with

[TABLE]

being the a priori probability for a source to be detected. Note that $\left\langle f_{\mathrm{det}}\right\rangle$ has no dependence on the measured properties of the detected sources, despite the fact that it appears to the power $-\hat{N}_{\mathrm{det}}$ .

While this identity makes the $N^{-1}$ prior convenient, a Poisson distribution (dependent on a mean hyperparameter, $\left\langle N\right\rangle$ ) is more appropriate in most astronomical scenarios. This is consistent with our earlier assumption of independently occurring sources. With such a prior, Equation 1 becomes

[TABLE]

where the identities $\left\langle f_{\mathrm{det}}\right\rangle=\left\langle N_{\mathrm{det}}\right\rangle/\left\langle N\right\rangle$ and $\left\langle f_{\mathrm{mis}}\right\rangle=\left\langle N_{\mathrm{mis}}\right\rangle/\left\langle N\right\rangle$ have been used, and where the last line discards a constant factor of $1/\hat{N}_{\mathrm{det}}!$ . The same expression can be derived (perhaps more intuitively) without the need to explicitly model and marginalize over $N$ by considering the Poisson likelihood for sources in bins of the $\hat{x}$ and $\hat{y}$ observables, and taking the limit of infinitesimally small bins (see Mantz et al. 2010a).

In practice, $\left\langle N\right\rangle$ may depend on further parameters of the astrophysical model (e.g. cosmological parameters in the galaxy cluster scenario of Section 2). However, for those cases where we lack a physically motivated prior for $\left\langle N\right\rangle$ , it may be convenient to assign a gamma distribution prior, as this makes the marginalization over both $N$ and $\left\langle N\right\rangle$ analytic. To see this, note that Equation 3.2 has the form of a gamma distribution for $\left\langle N\right\rangle$ , with shape $\hat{N}_{\mathrm{det}}+1$ and rate $\left\langle f_{\mathrm{det}}\right\rangle$ . If we take a gamma prior on $\left\langle N\right\rangle$ with shape $\alpha_{o}$ and rate $\beta_{0}$ , then

[TABLE]

discarding constant factors that depend only on $\hat{N}_{\mathrm{det}}$ , $\alpha_{0}$ and $\beta_{0}$ . In the second line above, the gamma density function has integrated to unity, provided that $\alpha_{0}>-\hat{N}_{\mathrm{det}}$ and $\beta_{0}>-\left\langle f_{\mathrm{det}}\right\rangle$ .

Though not infinitely flexible, the gamma distribution provides a range of potentially useful priors. Taking $\beta_{0}\rightarrow 0$ , it describes power-law priors of the form $\left\langle N\right\rangle^{\alpha_{0}-1}$ . For both $\beta_{0}\rightarrow 0$ and $\alpha_{0}\rightarrow 0$ ( $\left\langle N\right\rangle^{-1}$ , or uniform in $\log\left\langle N\right\rangle$ ), we intuitively recover Equation 5, while $\beta_{0}\rightarrow 0$ and $\alpha_{0}=1/2$ is the Jeffreys prior for $\left\langle N\right\rangle$ , $p\left(\left\langle N\right\rangle\right)\propto\left\langle N\right\rangle^{-1/2}$ . When $\left\langle N\right\rangle$ is expected to be large, approximately Gaussian priors can be accommodated by setting $\alpha_{0}=\mu^{2}/\sigma^{2}$ and $\beta_{0}=\mu/\sigma^{2}$ , where $\mu$ and $\sigma$ are the desired mean and standard deviation. If desired, a posteriori samples of $\left\langle N\right\rangle$ can be generated from the gamma distribution in Equation 8. A posteriori samples of $N$ could then be generated from a Poisson distribution; alternatively, if $\left\langle N\right\rangle$ is not of interest, samples of $N$ can be drawn directly from its marginalized posterior distribution,

[TABLE]

which is the negative binomial distribution with parameters $\alpha_{0}$ and $(\beta_{0}+1)^{-1}$ .

3.3 Ignorability

One of the central questions for this work is under what circumstances selection effects due to truncation require us to use Equation 3.2, rather than one of the simpler likelihoods $\mathcal{L}_{\mathrm{det}}$ or $\mathcal{L}_{\mathrm{sim}}$ . The latter is possible when the posterior for the parameters of interest can be written strictly in terms of the observed data. Gelman et al. (2004) refer to selection effects as ignorable in this case, and discuss the necessary conditions. To summarize, selection is ignorable if the following statements are both true:

The prior distribution for $\phi$ is independent of the prior distribution for all other parameters. 2. 2.

Selection does not depend on unobserved (or potentially unobserved) data.

The first condition we can assume without losing too much generality, but the second is generically violated in truncation problems. Our default expectation in these circumstances should thus be that the formalism above is necessary. We will see below that, in very special circumstances, selection effects are ignorable for the purposes of constraining the parameters of the regression, $\theta$ , though not necessarily $\Omega$ (assuming that the two are indeed separable). The extreme, and intuitive, example of this occurs when data are missing completely at random with respect to the measurements and parameters of interest; in that case, $\mathcal{L}_{\mathrm{det}}$ is naturally a sufficient likelihood.

Another way to put this is that using the likelihood $\mathcal{L}_{\mathrm{det}}$ alone is not the same as “not using information from the number of detections” – that would be most closely equivalent to marginalizing $N$ over an uninformative prior, as outlined above.

4 Simple Examples

4.1 Toy Data, Models, and Methods

To illustrate how this works in a more concrete way, we can consider a simplified version of the galaxy cluster survey case outlined in Section 2. Specifically, let $x$ represent the log-mass only (neglecting redshift), and take

[TABLE]

with $\lambda=2$ . This is an approximately appropriate distribution for the log-masses of galaxy clusters (e.g. Evrard et al. 2014), apart from the unphysical restriction $x\geq 0$ . We will consider two response variables, $y=(y_{1},y_{2})$ , with power-law slopes and an Gaussian intrinsic scatter covariance roughly appropriate for the log X-ray luminosity and log temperature of the intracluster gas, respectively (Allen et al., 2011; Giodini et al., 2013);

[TABLE]

where $\mathcal{N}$ denotes the multivariate normal density function for a given mean and covariance matrix. In particular, the marginal intrinsic scatter in $y_{1}$ at fixed $x$ is relatively large, and its average scaling is relatively steep, compared with the scatter and power-law slope of $y_{2}$ , and the two scatters are moderately correlated. Measurement errors are assumed to be Gaussian, uncorrelated and identical for all sources, again with typical magnitudes, corresponding to respectably high signal-to-noise data;

[TABLE]

Specifically, measurement errors were $s_{x}=0.2$ (roughly the intrinsic scatter due to correlated structure in cluster mass estimates from weak gravitational lensing; Becker & Kravtsov 2011),333The assignment of a simple measurement error for mass contravenes the advice in Section 2, but is adopted for simplicity here. $s_{y_{1}}=0.05$ and $s_{y_{2}}=0.1$ . A complete (before truncation) mock data set of $10^{4}$ clusters was generated using these parameters, which are summarized in Table 1. To make explicit the link to the notation of Sections 2–3, we have $\Omega=\{\lambda\}$ and $\theta=\{\alpha_{1},\alpha_{2},\beta_{1},\beta_{2},\sigma_{1},\sigma_{2},\rho\}$ .

The following subsections will apply a simple selection on either $\hat{y}_{1}$ or $\hat{x}$ and contrast constraints obtained using the complete likelihood of Equations 3.2–8 with those obtained using only $\mathcal{L}_{\mathrm{det}}$ (equivalently, $\mathcal{L}_{\mathrm{sim}}$ ; Equations 3–4). The constraints from $\mathcal{L}_{\mathrm{det}}$ were computed using a Python-language version of the lrgs code that was straightforwardly extended to use the exponential form of $p(x)$ in Equation 10 rather than the usual Gaussian mixture. Constraints from the full likelihood were found by alternating lrgs conjugate-Gibbs sampling of the parameters that do not appear in $\left\langle f_{\mathrm{det}}\right\rangle$ ( $x_{i}$ and $y_{i}$ ) with Metropolis sampling (via the lmc code444https://github.com/abmantz/lmc) of the remaining parameters ( $\lambda$ , $\alpha_{i}$ , $\beta_{i}$ , $\sigma_{i}$ and $\rho$ ), a strategy implemented as a submodule of lrgs. I will therefore refer to the two methods as lrgs and lrgs.trunc, respectively.

Identical priors were applied to the parameters common to the two methods, specifically uniform priors for $\alpha_{i}$ , $\beta_{i}$ and $\rho$ ; the Jeffreys prior for $\sigma_{i}^{2}$ , $p(\sigma_{i}^{2})\propto\sigma_{i}^{-2}$ ; and a Gaussian prior for $\lambda$ , with mean 2 and standard deviation 0.05. For the lrgs.trunc method, I took an uninformative Gamma prior on $\left\langle N\right\rangle$ , with $\alpha_{0}=1/2$ and $\beta_{0}=0$ , and followed the procedure in Section 3.2 to marginalize over $\left\langle N\right\rangle$ analytically and generate samples a posteriori. These choices for $\lambda$ and $\left\langle N\right\rangle$ priors mirror the typical situation in the analysis of galaxy cluster surveys, where we have prior information on the shape of the mass function, but wish to either fit for or marginalize over its normalization.

4.2 Selection on the Survey Response Variable

When the scaling relation of interest is for the dependent variable on which selection is based, it is clear that the requirements for selection to be ignorable are not met (Section 3). Consider the simple detection requirement $\hat{y}_{1}>y_{\mathrm{1,lim}}$ , i.e. $P_{\mathrm{det}}(\hat{y}_{1}|y_{\mathrm{1,lim}})=\Theta(\hat{y}_{i}-y_{\mathrm{1,lim}})$ , with $\Theta$ the unit step function. Figure 1 illustrates this selection on the mock data set with $y_{\mathrm{1,lim}}=1.5$ , for which 658 points are “detected” in this particular realization. This is a sufficiently large data set that the systematic error introduced by using an incorrect likelihood is significant compared with the width of the posterior. In the first case, consider a fit only involving $x$ and $y_{1}$ , ignoring any information about $y_{2}$ (but see Section 4.4).

For the particular scenario described above, we have

[TABLE]

where $\Phi$ is the standard normal cumulative distribution function. More generally, selection on $\hat{y}_{1}$ implies that $\left\langle f_{\mathrm{det}}\right\rangle$ will depend explicitly on the parameters governing the marginal scaling relations of $y_{1}$ ; hence, the correct posterior for these parameters cannot be recovered if terms in the likelihood involving $\left\langle f_{\mathrm{det}}\right\rangle$ are neglected. Schemes that employ only “bias corrections” of the sampling distribution, by setting $p(\hat{y}_{1}|y_{1})$ to zero below $\hat{y}_{1}=y_{\mathrm{1,lim}}$ and renormalizing it (e.g. Vikhlinin et al. 2009; Sereno et al. 2015), do not address this feature. Note that both the methods considered here already implicitly include this information, since $P(\hat{y_{1}}<y_{\mathrm{1,lim}})=0$ for all detected sources.

The constraints obtained from lrgs and lrgs.trunc are respectively shown as red and blue contours in the right panel of Figure 1; evidently, the former disagree with the input parameter values at high significance.

How can we intuitively understand this? First, it’s worth noting that prior information about the form of $p(x)$ in has nothing to do with the bias in the constraints from lrgs. In fact, fixing $\lambda$ (which lrgs gets spectacularly wrong, in spite of the informative prior) to the true value does not significantly change the constraints on the scaling parameters. More formally, if we take the limit of zero measurement errors on $\hat{x}$ , the likelihood $\mathcal{L}_{\mathrm{sim}}$ (or $\mathcal{L}_{\mathrm{det}}$ ) provides no mechanism to produce covariance between $\lambda$ and the scaling parameters. We should therefore expect the bias produced by neglecting terms with $\left\langle f_{\mathrm{det}}\right\rangle$ in the likelihood to persist, even with perfect prior information on $p(x)$ , despite the fact that the true $p(x)$ clearly implies that a substantial number of objects must be missing at $x\ {\raise-3.22916pt\hbox{$ \buildrel<\over{\sim} $}}\ 2$ .

The reason that $\mathcal{L}_{\mathrm{sim}}$ cannot recover the input scaling relation, even with accurate prior information about $p(x)$ , is that it fails to capture the systematic way in which sources with small $x$ are missing, specifically the dependence on their value of $\hat{y}_{1}$ . The practice of truncating the sampling distribution for $\hat{y}_{1}$ at $y_{\mathrm{1,lim}}$ is of no help here; while it may prevent the cluster of observed points just above $y_{\mathrm{1,lim}}$ at small $x$ from significantly penalizing models near the truth, models with shallower slopes that pass closer to these points will still be preferred. This is exactly what we see in the constraints from lrgs.

In contrast, the lrgs.trunc method, when provided with the same prior information, recovers the true parameter values. In this case, the highly biased points detected at small $x$ do carry significant information. Their particular values of $\hat{y}_{1}$ are not very informative for models near the truth – $y_{\mathrm{1,lim}}$ is so far from the mean scaling relation that detected points must lie just above the threshold. But the number of sources that exceed $y_{\mathrm{1,lim}}$ , combined with knowledge of $\left\langle dN/dx\right\rangle$ , constrains the scaling relation and its scatter. In this example, there are 100 detections with $x<1$ , implying (for $\left\langle N\right\rangle=10^{4}$ ) that $y_{\mathrm{1,lim}}$ exceeds the mean relation by $\sim 2$ – $3\,\sigma_{1}$ in this regime. This dependence of the interpretation of the data on $\left\langle N\right\rangle$ is illustrated by the degeneracies between $\left\langle N\right\rangle$ and the other parameters of interest in Figure 1.

4.3 Selection on the Covariate

At the other extreme, consider selection on $\hat{x}$ instead of $\hat{y}_{1}$ , $P_{\mathrm{det}}(\hat{x}|x_{\mathrm{lim}})=\Theta(\hat{x}-x_{\mathrm{lim}})$ , again ignoring $y_{2}$ for the moment. In this case, $\left\langle f_{\mathrm{det}}\right\rangle$ has an analogous form to Equation 38,

[TABLE]

Intuitively, the detected fraction is now independent of the scaling relation parameters, $\theta$ . It follows that:

•

If our model for $\left\langle dN/dx\right\rangle$ is fixed a priori, then $e^{-\left\langle f_{\mathrm{det}}\right\rangle\left\langle N\right\rangle}\left\langle N\right\rangle^{\hat{N}_{\mathrm{det}}}$ is a constant and selection effects are ignorable. This holds regardless of whether there are non-zero measurement errors. If $\left\langle N\right\rangle$ is a free parameter, selection is still ignorable for inferences about $\theta$ because the likelihood factors into one part that depends on $\left\langle N\right\rangle$ and another that depends on $\theta$ , with no free parameters appearing in both.

•

If measurement errors on $\hat{x}$ are zero (the latent parameters $x$ are effectively fixed), then selection is ignorable for inferences about $\theta$ (only). This is because the observed data are always complete and therefore unbiased for every $x$ that is represented in the data set.

The above are special cases, however. In general, when there are non-zero measurement errors on $\hat{x}$ and the model for $p(x)$ is not fixed, selection must be accounted for. Note that even these exceptions depend on the parameters governing $p(x)$ and the scaling relation ( $\Omega$ and $\theta$ ) being distinct, which was an assumption in this application (Section 2), but is not true of all possible applications.

The mock data selection for $x_{\mathrm{lim}}=1$ ( $\hat{N}_{\mathrm{det}}=1419$ ) and the resulting constraints appear in Figure 2. As one might guess from the discussion above, the constraints from lrgs are less biased than before, although the joint posterior for $\alpha_{1}$ and $\beta_{1}$ is still inconsistent with the input parameters at high significance. Again, fixing the value of $\lambda$ would not eliminate the bias on the other parameters (see above).

4.4 Selection on a Correlated Response Variable

Next, consider the case where we are interested in the scaling of $y_{2}$ with $x$ when the data set is selected on a different response variable, $\hat{y}_{1}$ . The key question here is whether $\hat{y}_{1}$ and $\hat{y}_{2}$ are correlated at fixed $x$ , either due to correlation of their measurement errors or due to an intrinsic covariance of $y_{1}$ and $y_{2}$ at fixed $x$ . If no such correlation is possible, then selection on $\hat{y}_{1}$ is equivalent to a (possibly noisy) selection on $x$ , and the comments in Section 4.3 apply.

For illustration, Equation 21 can be rewritten as

[TABLE]

This factorization demonstrates how our interpretation of $y_{2}$ may depend on information about $y_{1}$ , such as satisfaction of a selection criterion, for $\rho\neq 0$ . Specifically, the difference between $y_{1}$ and its mean value predicted by the scaling relation, $\alpha_{1}+\beta_{1}x$ , impacts our interpretation of the analogous displacement of $y_{2}$ from its mean scaling law. Thus, for a positive correlation, at low masses ( $x$ ) we expect a selection on luminosity ( $y_{1}$ ) to bias the observed data high in both luminosity and temperature ( $y_{2}$ ). The detected fraction for selection on $\hat{y}_{1}$ is given by Equation 38.

As in Section 4.3, we can see that selection effects are ignorable for inference of $\alpha_{2}$ , $\beta_{2}$ and $\sigma_{2}$ only in very special circumstances, namely if

the marginal scaling relation for $y_{1}$ (i.e. the values of $\alpha_{1}$ , $\beta_{1}$ and $\sigma_{1}$ ) and the intrinsic correlation coefficient, $\rho$ , are fixed a priori; and 2. 2.

the model for $p(x)$ is fixed or the measurement errors for $\hat{x}$ are zero.

Note that the second condition is identical to the requirement for selection on $\hat{x}$ to be ignorable. The requirements of the first condition above are exactly those that make selection on $\hat{y}_{1}$ equivalent to a noisy selection on $\hat{x}$ , with the nature of that stochasticity fully understood.

Using the same selection as in Section 4.2 ( $y_{\mathrm{1,lim}}>1.5$ ), Figure 3 shows the complete and observed mock data set in terms of $\hat{x}$ and $\hat{y}_{2}$ . Due to the modest intrinsic correlation ( $\rho=0.5$ ) and relatively smaller marginal scatter $\sigma_{2}$ compared with $\sigma_{1}$ , selection effects on the observed data are less visually dramatic than in Figures 1–2. Nevertheless, we will see below that neglecting to model the truncation results in biased inferences.

Figure 4 compares constraints from lrgs and lrgs.trunc in the usual way, where both codes are now fitting joint scaling relations and scatter for $y_{1}$ and $y_{2}$ as a function of $x$ . In addition, constraints are shown from an lrgs analysis where $y_{1}$ is disregarded completely, i.e. simply fitting $y_{2}$ against $x$ without accounting for selection effects. In this particular case, both lrgs analyses are consistent with the input value of $\sigma_{2}$ , as one might guess by inspection of Figure 3, but produce biased constraints (to differing degrees) on $\alpha_{2}$ and $\beta_{2}$ .

Note that a non-zero correlation in the measurement errors of $y_{1}$ and $y_{2}$ would play essentially the same role as the intrinsic correlation, $\rho$ , in the discussion above.

5 Discussion and Conclusions

Although the examples explored above are far from exhaustive, hopefully it’s clear that selection effects have the potential to dramatically bias the results of otherwise straightforward model fitting if not taken into account. Exactly how important this systematic effect is compared with the statistical uncertainties is not a simple question to answer in general, as it will depend not just on the number of observed data points and their error bars, but also on the selection mechanism and the true, underlying model. One could always straightforwardly test whether simple fitting methods are able to recover the correct parameter values by running them on mock data appropriate for a given situation, along the lines of the examples above. A better option, whenever feasible, would be to properly include selection in the model being fit.

An unfortunate feature of models that account for selection is that they lack the full conjugacy that allows all of the parameters in models like those used by lrgs and linmix_err to be efficiently Gibbs sampled, even for simple selection mechanisms like those considered here. Specifically, conjugacy will generally be lost for any parameters appearing in $\left\langle f_{\mathrm{det}}\right\rangle$ . It is, however, still possible to efficiently Gibbs sample the remaining parameters, in particular $x_{i}$ and $y_{i}$ , under the assumptions made by these codes, namely Gaussian (or similarly convenient) forms of the measurement errors, intrinsic scatter, and $p(x)$ . Since $x_{i}$ and $y_{i}$ normally account for the great majority of the free parameters in such models, mixing conjugate Gibbs sampling of with some other method of sampling the remaining parameters, as outlined in Section 4.1, is a viable strategy for these cases (though I by no means claim it to be the most efficient strategy). Note that this strategy is not without its pitfalls; in particular, when using mixture models, the potentially large number of parameters and the exchangeability of the mixture components can make sampling challenging (this is a generic feature, not specific to truncation problems). In addition, it’s potentially helpful that $\left\langle N\right\rangle$ can be marginalized analytically for a wide range of approximately power-law and Gaussian priors.

A basic and intuitive feature of truncation is that our interpretation of the data relies to some extent on a model for the population of sources that were not observed. There are two immediate consequences of this. Firstly, we can expect our results in general to be sensitive to prior information about $\left\langle dN/dx\right\rangle$ , including the form of $p(x)$ . Thus, the Gaussian mixture models employed by some “out of the box” codes, while convenient and flexible, are no substitute for accurate modeling of $p(x)$ . Inspection of the distributions of $\hat{x}$ selected from the mock data sets analyzed above (Figure 5) makes clear that no amount of flexible but uninformed modeling of the observed $\hat{x}$ data is likely to recover or even be consistent with the underlying, non-Gaussian form of $p(x)$ . While we may not need precise knowledge of the true $p(x)$ a priori to obtain correct results in this example, we would likely at least need the prior that $p(x)$ is monotonically decreasing. Secondly, the amount of data required for our results to be data-dominated rather than prior-dominated will generally be greater than in problems without truncation, and may not be particularly obvious. Analysis of mock data sets is probably the best way to get a handle on this.

Despite these complicating aspects, the general solution for fitting truncated data is relatively straightforward. This is encouraging, given that truncation is such a common feature of astrophysical data.

Acknowledgments

This work was supported by the National Aeronautics and Space Administration under Grant No. NNX15AE12G issued through the ROSES 2014 Astrophysics Data Analysis Program. I thank Gus Evrard and Arya Farahi for interesting discussions, and the anonymous referee for very good suggestions.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Allen et al. (2011) Allen S. W., Evrard A. E., Mantz A. B., 2011, ARA&A , 49, 409 · doi ↗
2Becker & Kravtsov (2011) Becker M. R., Kravtsov A. V., 2011, Ap J , 740, 25 · doi ↗
3Eddington (1913) Eddington A. S., 1913, MNRAS , 73, 359 · doi ↗
4Efron & Petrosian (1992) Efron B., Petrosian V., 1992, Ap J , 399, 345 · doi ↗
5Efron & Petrosian (1994) Efron B., Petrosian V., 1994, J. Am. Stat. Assoc., 89, 452
6Efron & Petrosian (1999) Efron B., Petrosian V., 1999, J. Am. Stat. Assoc., 94, 824
7Evrard et al. (2014) Evrard A. E., Arnault P., Huterer D., Farahi A., 2014, MNRAS , 441, 3562 · doi ↗
8Gaidos & Mann (2013) Gaidos E., Mann A. W., 2013, Ap J , 762, 41 · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Coping with Selection Effects:

Abstract

keywords:

1 Introduction

2 A Concrete Scenario

3 Theory

3.1 Likelihood

3.2 Prior Distributions for NNN

3.3 Ignorability

4 Simple Examples

4.1 Toy Data, Models, and Methods

4.2 Selection on the Survey Response Variable

4.3 Selection on the Covariate

4.4 Selection on a Correlated Response Variable

5 Discussion and Conclusions

Acknowledgments

3.2 Prior Distributions for $N$