Identifying Causal Genotype–Phenotype Relationships for Population‐Sampled Parent–Child Trios
Yushi Tang, Irineo Cabreros, John D. Storey

TL;DR
This paper introduces a new statistical test to identify causal genetic effects using parent-child trio data, leveraging genetic transmission as a natural randomization process.
Contribution
The paper introduces the transmission mean test (TMT) and clarifies the causal interpretation of the transmission disequilibrium test (TDT).
Findings
The TMT is suitable for arbitrarily distributed traits and random trio sampling.
The TDT is a causal test for dichotomous traits in affected-only designs.
Transmission-based methods offer distinct advantages over existing genotype-phenotype analysis approaches.
Abstract
The process by which genes are transmitted from parent to child provides a source of randomization preceding all other factors that may causally influence any particular child phenotype. Because of this, it is natural to consider genetic transmission as a source of experimental randomization. In this work, we show how parent–child trio data can be leveraged to identify causal genetic loci by modeling the randomization during genetic transmission. We develop a new test, the transmission mean test (TMT), together with its unbiased estimator of the average causal effect, and derive its causal properties within the potential outcomes framework. We also prove that the transmission disequilibrium test (TDT) is a test of causality as a complementary case of the TMT for the affected‐only design. The TMT and the TDT differ in the types of traits that they can handle and the study designs for…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6| 0 | 0 | 1 | 0 | 1 | 0 | |||
| 0 | 1 | 0 | 2 | 0 | 0 | |||
| 1 | 0 | 1 | 1 | 0 | 1 | |||
| 0 | 0 | 2 | 0 | 1 | 0 | |||
| 0 | 1 | 0 | 1 | 0 | 0 |
- —National Human Genome Research Institute10.13039/100000051
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Associations and Epidemiology · Genetic Mapping and Diversity in Plants and Animals · Advanced Causal Inference Techniques
Introduction
1
The ultimate goal of understanding the genotype–phenotype relationship is to uncover genetic variants that are causal for the phenotype. The gold standard statistical framework for establishing causality is the potential outcomes framework (Neyman 1923; Rubin 1974; Imbens and Rubin 2015). Recently, there has been a growing interest in causal inference in genetics and genomics (Chen et al. 2007; Bates et al. 2020; Jiao et al. 2022; Y. Yang et al. 2022; Du et al. 2025). This paper investigates how human trio data (two parents and their child) can enable tests of causality based on comparisons of trait level potential outcomes corresponding to different alleles transmitted from the parents to the child. The motivation lies in the connection between the meiotic process and experimental randomization, which is intuitively compelling but has not been rigorously established in all scenarios. The biological process of meiosis is inherently random, ensuring that multiple children from a single pair of parents have independent variation in their inherited genetics. Since the variation induced by meiosis precedes other events that may influence a particular trait and it is randomized within a trio, meiosis resembles a form of experimental randomization. This was recognized early in the development of genetics, such as in Fisher's 1951 lecture (Fisher 1952):The different genotypes possible from the same mating have been beautifully randomized by the meiotic process. A more perfect control of conditions is scarcely possible, than that of different genotypes appearing in the same litter.
In a properly randomized experiment, it is a basic result that association implies causation (Rubin 1974; Holland 1986; Greenland 1990). If one accepts meiosis as a valid form of experimental randomization, then genotypes associated with a phenotype can be interpreted causally by leveraging the randomization of alleles transmitted during meiosis.
Fisher's observation that the meiotic process is a valid form of randomization is essentially correct within the single family or litter setting. However, when analyzing population‐sampled trios that consist of multiple families, the analogy between a randomized experiment and meiosis does not trivially extend to the collection of families analyzed together. For instance, in a sampled population, there may be confounding between genetic variation and a trait. One reason for this is population structure, commonly observed in human populations (Hao and Storey 2019). In a population, the processes that gave rise to genetic variation may be related to the processes that influence a trait.
Even though a phenotype cannot, in principle, change a genotype and even though genotypes are randomly sampled, this does not mean that population‐based studies in general contain the information needed to leverage the randomization due to meiosis. A related but distinct line of work, referred to as “Mendelian randomization” (MR), aims to identify nongenetic factors that are causal for phenotypes of interest (Smith and Ebrahim 2003; Lawlor et al. 2008; Qi and Chatterjee 2019; Howey et al. 2020; Zhu 2021; Zou et al. 2024). The challenges and underlying assumptions of MR have been variously discussed (Smith and Ebrahim 2004; Thomas and Conti 2004; Nitsch et al. 2006; Bochud et al. 2008; VanderWeele et al. 2014; Zhang 2023). MR constructs instrumental variables from genotypes for population‐sampled individuals but does not directly observe the meiotic process from parents to child. It is not a potential outcomes framework applied to trios where randomization is directly observed, as we consider here. A recent study developed an MR framework using trio data and a specific causal graphical model (Tudball et al. 2022). Within this framework, it remains challenging to perform causal inference without enumerating all biological factors that may influence causal effects estimation.
The current state of understanding is that other than in the single‐family setting, therefore, association implies neither causation nor proximity to causal loci, which is why association studies are not considered to be linkage or causality studies. Genome‐wide association studies (GWAS) determine statistical associations between genotypes and phenotypes without any general proof of causality. A framework has been proposed for causal inference where the parents and child have phased genomes available (Bates et al. 2020). This framework utilizes a definition of causality based on probabilistic independence (Pearl 2009), which is not necessarily equivalent to the potential outcomes we employ here (Rubin 1974; Imbens and Rubin 2015).
A model has been studied that identifies associations between the child phenotype and a linear model containing the child genotype and parent genotypes as additive terms (Kong et al. 2018; Young et al. 2022). This model focuses on a special class of traits where the parents' genotypes may have a strong environmental effect on the child's trait. The literature refers to the association parameter of the child's genotype as a “direct effect” and the association parameter(s) of the parents' genotypes as “indirect effects.” Even though in colloquial language “direct effect” implies causality, this model performs an association test and the term “direct” is used to distinguish the child genotype association from the parent genotypes associations. Although this model has different goals than ours, we nevertheless relate it to the potential outcomes framework.
Here, we develop a robust randomization‐based causal framework for genotype–phenotype relationships by considering two transmission‐based methods—one introduced here for arbitrarily distributed phenotypes in population‐sampled trios and the other for an existing method applied to dichotomous traits in a particular type of trio study. These methods do not require phased genomes, but rather standard diploid genotypes for parents and child. We name the first approach the transmission mean test (TMT) and refer to the second approach under its known name, the transmission disequilibrium test (TDT) (Terwilliger and Ott 1992; Spielman et al. 1993; Ewens and Spielman 1995). The TMT is a new set of methodologies developed here for both quantitative (continuous or counts) and binary phenotypes, when trios are randomly sampled from an arbitrarily structured population with other potential genotype–phenotype confounders. Such study designs occur, for example, when trios are sampled from a defined geographical region (Pediatric Cardiac Genomics Consortium and Writing Committee 2013; Bycroft et al. 2018; All of Us Research Program Genomics Investigators 2024) or from a particular population (Magnus et al. 2016), or when trios are sampled on the basis of parent‐level attributes (Newman et al. 2011). The TDT handles trios that have an “affected” child of a binary phenotype. Previous work described the study design for the TDT as affected‐only design, also known as affected family‐based controls, and have developed the TDT as a statistical test of association (Allison 1997; Abecasis et al. 2000; Laird et al. 2000; Ray et al. 2022). We show here that the TDT is a rigorous test of causality.
The main contributions of this work are to propose the TMT and to establish that the TMT and the TDT are valid causal inference methods within the potential outcomes framework. We start by assuming that the causal and noncausal variants are probabilistically independent. We define the concept of direct causality and develop the TMT to detect causal genetic effects for a broad class of traits. We prove that the TDT is a causal test for dichotomous traits in the affected‐only design. We then extend the causal framework of both the TMT and the TDT to the case where causal and noncausal loci may be genetically linked by developing the concept of causal linkage. In order to identify scenarios appropriate for these methods and how they can be used together with existing methods, we compare the proposed causal inference framework with established association methods, such as the linear mixed model for associations (LMM) (Kang et al. 2010; J. Yang et al. 2011), in the presence of population structure and confounding factors associated with both the trait and causal genotypes.
Theory and Methods
2
Data Structure and Method Overview
2.1
Trio Data
2.1.1
The data for an observed trio includes the child's measured phenotype as well as the genome‐wide genotypes of both parents and the child (Figure S1). The goal here is to identify the genetic loci that are causal for the measured phenotype among the children. Let J be the total number of trios. A sample of J trios will involve 3J distinct individuals. We assume that no individual is the parent of more than one child in the sample and no individual appears both as a parent and as a child in the sample.
Our framework starts from the unphased biallelic single‐nucleotide polymorphism (SNP) data, which is more common than phased data in practical settings. Let I be the total number of SNPs per individual. For a SNP, let a and b be the two alleles that generate three possible genotypes {aa,ab,bb}. We numerically encode these three genotypes by {0,1,2}. In the jth trio, j∈[1:J], let Gij∈{0,1,2} be the child's ith genotype, Zijm∈{0,1,2} the maternal ith genotype, and Zijp∈{0,1,2} the paternal ith genotype, i∈[1:I]. A complete set of trio genotype data includes the I×J matrix G={Gij} for the child and I×J matrices Zm={Zijm} and Zp={Zijp} for the parents.
We additionally denote Gij=Aijm+Aijp for each child, where Aijm∈{0,1} is the allele the jth child received from the maternal side and Aijp∈{0,1} the allele from the paternal side. We do not assume that the parental transmitted alleles Am={Aijm} and Ap={Aijp} are directly observed in our framework. In addition to the genotype data, we observe a J‐length vector of child's phenotypes Y={Yj} where Yj can be either quantitative (continuous or counts) or dichotomous. Let C be the index set of causal SNP(s) for a specific trait Y. The goal is to identify all i∈C such that there exists a causal effect from Gi to Y. As Gi=Aim+Aip and we are not distinguishing maternal and paternal genetic effects, an equivalent aim is to estimate the causal effect from (Aim,Aip) to Y.
Overview of the Proposed Framework
2.1.2
Here, we give a conceptual overview of the proposed TMT. The main idea is the characterization of the randomizations that have taken place, shown in Figure 1. Each child's trait value is potentially subject to two independent within‐trio randomizations, one from the maternally transmitted allele and one from the paternally transmitted allele. These transmitted alleles, Aim and Aip, can be either 0 or 1. In order for a meaningful randomization to take place, the parent must be heterozygous. For each heterozygous parent, we assign the child to the control group if receiving an allele 0 and to the treatment group if receiving an allele 1. The control and treatment group labels are arbitrary. It is the between group difference of trait values that we need in order to estimate causal effects. The key algorithm is the assignment procedure that constructs the control and the treatment groups. If a heterozygous parent in the jth trio transmits an allele 0 to the child, the corresponding Yj is included in the control. Conversely, if transmitting an allele 1, Yj is included in the treatment. It is possible for a trait value to go into both the control and treatment groups (wherein our proposed statistic cancels out the trait value). It is also possible for a trait value to be assigned twice to the control group if both heterozygous parents contribute 0 alleles and likewise for the treatment group, as shown in Figure 1.
Schematic of the TMT assignment procedure. Encode the two possible transmitted alleles at the ith locus as allele 0 and 1. Let Y be the value of the child's phenotype. In the jth trio, if a heterozygous parent transmits an allele 0 to the child, the corresponding phenotype Yj is included in the control group. If a heterozygous parent transmits an allele 1, the corresponding Yj is included in the treatment group. When there are two heterozygous parents, a trait can be included twice within one group or it can be assigned simultaneously to both groups. We do not observe a randomization from a homozygous parent.
Two vectors, Y0 and Y1, are generated, representing the observed trait values in the control and the treatment. Both Y0 and Y1 contain two types of elements: one attributed to the maternal side and the other attributed to the paternal side, written as:
where Y0=(Y0m,Y0p),Y1=(Y1m,Y1p).
Randomizations, when occurring in concordance with the required assumptions, allow one to straightforwardly infer causality from associations, even in the presence of confounders. This is well known in simple scenarios, such as a study with a single randomized variable that meets all required assumptions. However, the scenario we consider here with trios is nonstandard and presents some interesting challenges. Sampled trios can have zero, one, or two heterozygous parents. A trio with two heterozygous parents allows us to record two randomizations and it does not leave a homozygous parent acting as a potential confounder. However, the randomization of the two parents can be in opposite directions. In the case of one heterozygous parent and one homozygous parent, it should be noted that confounders with genotype (e.g., via population structure or confounding nongenetic variation) can still enter the child's trait value through the homozygous parent since there is no meaningful randomization from that parent. Trios with zero heterozygous parents are not used to infer causality as there are no meaningful randomizations.
We introduce a trait model and formulate it as general as possible to account for confounders. We then define a specialized potential outcomes model of causal effects, as it is nonstandard to randomly have zero, one, or two of the potential randomized variables manifesting as discernable randomizations. After these model formulations, we construct a causal effect estimand, together with a TMT statistic, based on the differences between the values in Y0 and Y1, and we prove several key operating characteristics of them.
Model Assumptions and Potential Outcomes
2.2
Here, we formulate a trait model and a causal effects model for each SNP and child combination. We therefore drop the SNP subscript i and individual subscript j until they are needed. We start by assuming no linkage between the SNP of interest and other causal SNPs.
The Trait Model
2.2.1
The child's trait Y is modeled as:
where ℐ(·) is the indicator function. It is the case that Am,Ap,G, and γ are all random variables whereas α0,α1, and α2 are fixed effects. Due to possible relatedness and population structure, the parental genotypes Zm and Zp may be arbitrarily dependent random variables, making Am and Ap dependent. (We show below that some independence results on Am and Ap hold when conditioning on certain parental genotypes Zm and Zp.) We assume that γ does not interfere with the transmitted allele from heterozygote parents, detailed precisely in Assumption 2. This model encompasses more general genotypic representations than a model with additive genetic effects.
One may propose to fit the trait model in Equation (1) by a simple linear regression between the child's trait Y and genotypes G. In order to obtain an unbiased estimate of α0,α1, and α2, such a regression approach would require exogeneity, that is, E[γ∣G]=0 or equivalently C(γ,G)=0. However, this is not necessarily satisfied according to the assumptions underlying Equation (1). For example, when there is population structure or dependence among genetic and nongenetic effects, then exogeneity is violated.
One assumption we make about the trait model is that the genetic effects are either nondecreasing or nonincreasing, but we do not assume an additive genetic model. As shown below, this allows us to tractably deal with the fact that there are two potential parental randomizations within each trio.
Assumption 1The conditional expectation of Y given parental transmitted alleles, E[Y∣Am,Ap], is either a nondecreasing function of Am+Ap such that
or E[Y∣Am,Ap] is analogously a nonincreasing function of Am+Ap.
Note that under the trait model in Equation (1), if E[Y∣Am,Ap] is nondecreasing, then α0≤α1≤α2; if E[Y∣Am,Ap] is nonincreasing, then α0≥α1≥α2. Since the alleles are arbitrarily coded as 0 or 1, without loss of generality we let the allele be 1 that yields an nondecrease in E[Y∣Am,Ap] and 0 otherwise. This is only for mathematical convenience and does not restrict our procedure or theory.
Potential Outcomes
2.2.2
To study the causal effect of the parental transmitted alleles (Am,Ap) on the child's phenotype Y in standard potential outcomes parlance, we model potential outcomes as variables representing the phenotype that the child would have developed if receiving a particular allele from the parental side of interest. Let A be the parental transmitted allele. The potential outcomes are Y(A=0) and Y(A=1), sometimes simplified as Y(1) and Y(0). Then the observed trait value Y can be written as a function of the potential outcomes:
Since the meiotic process for the transmission of the allele only happens once for each child, either Y(0) or Y(1) is observed, but not both. When considering a particular parental side, we write Ym(0) and Ym(1) for the maternal side and Yp(0) and Yp(1) for the paternal side. Under the trait model in Equation (1), the potential outcomes are:
It is important to note that our potential outcomes are defined with respect to the alleles inherited from the parent, reflecting the randomized Mendelian transmission process. This differs from a direct intervention on the child's genotype, such as might be feasible with gene‐editing technologies. While our framework captures the causal effects of alleles as transmitted through inheritance, direct manipulation of the child's genome may require a different causal definition and could lead to different methods.
The potential outcomes in Equation (2) are on a per parent basis, while the goal is to consider both parents simultaneously. We will next define causal effects on a per parent basis and then combine them for an overall joint parent transmitted allele causal effect—or equivalently causal effect of G on Y.
Causal Effects and Regression Effects
2.3
Average Causal Effect
2.3.1
A standard way to test for direct causality is to test for a nonzero average causal effect (ACE) which is a function of potential outcomes.
Definition 1 (ACE for an allele) The ACE of a parental transmitted allele A on Y is the average difference between potential outcomes Y(A=1) and Y(A=0):ACE(A→Y)=E[Y(A=1)]−E[Y(A=0)].
Under the trait model in Equation (1) and Assumption 1, notice that ACE(A→Y)=0 if and only if α0=α1=α2. Otherwise, ACE(A→Y)≠0. This allows us to define the ACE of the child's genotype on the trait in a manner that permits us to consider both parental transmitted alleles.
Definition 2 (ACE for genotype) Under the trait model and Assumption 1, we say ACE(G→Y)=0 if and only if α0=α1=α2. Otherwise, we say ACE(G→Y)≠0.
An alternative formulation for ACE(G→Y) would be to define ACE relative to a reference homozygote, the other homozygote, and the heterozygote. However, this would require introducing potential outcomes under simultaneous randomized transmissions of both parental alleles. For the heterozygote, such a formulation would involve outcomes of the form Y(Am=0,Ap=1) and Y(Am=1,Ap=0). In practice, it is technically difficult to distinguish between these two cases, as the causal effect cannot be traced separately by parental side of transmission. Our construction instead focuses on defining a nonzero ACE from the child genotype G to trait Y directly, which is sufficient to identify and test for causal effects in the model without complicating the notation or assumptions. For this reason, we adopt the more direct definition of a nonzero ACE for G in Definition 2, which suffices for the purposes of our test.
Since G=Am+Ap, we will use ACE(G→Y) and ACE((Am,Ap)→Y) interchangeably. We show formally that ACE(A→Y)≠0 if and only if ACE(G→Y)≠0 through the following lemma.
Lemma 1Under the trait model and Assumption 1,
We prove Lemma 1 in Appendix SA.1. By Lemma 1, a sufficient and necessary step to test G→Y is to test the sum of ACE(Am→Y) and ACE(Ap→Y) for equality to zero or not. This allows us to use trios with only one parental heterozygote or trios with two parental heterozygotes in a single estimator and test statistic.
Definition 3 (Direct causality)
- (A)The parental transmitted allele A is directly causal for the child's trait Y, denoted A→Y, if ACE(A→Y)≠0;
- (B)The child's genotype G is directly causal for Y, denoted G→Y, if ACE(G→Y)≠0.
Note that the definition of ACE above does not include there being a linkage between the target variant and other causal variants. We extend this definition to including linkage in Section 2.10.
Difference Between the Causal Effect and Regression Effect
2.3.2
Since the observed samples are drawn from the conditional distribution Y∣A instead of the marginal distribution of Y, what we observe in reality is a conditional expectation that we address by the average regression effect (ARE), ARE(Y∣A)=E[Y∣A=1]−E[Y∣A=0]. In general, ARE(Y∣A) does not equal ACE(A→Y), which is true for the trait model in Equation (1) where the AREs are:
while the ACEs are:
Scenarios where the ACE and ARE may be unequal include but are not limited to:
- (A)The dependence between A and Y is confounded by another variable such as population structure so that E[γ∣A=1]−E[γ∣A=0]≠0;
- (B)The parental transmitted alleles Am and Ap are not independent, which is common when population structure or relatedness exists.
Therefore, a direct regression between Y and Am+Ap is usually not sufficient to detect causal effects. In Appendix SA.14, we relate the potential outcomes model to a regression model that involves both child and parent genotypes as additive terms (Kong et al. 2018; Young et al. 2022).
Randomization via Heterozygous Parents
2.4
We develop an inference framework to estimate ACE rather than ARE. To this end, we are including the randomizations based on the parental genotypes Zm and Zp. Under Mendel's laws, a key fact is
which is the probability of the child to receive allele 1 instead of allele 0 given Z. Here Z=1 denotes a heterozygous parent who has an allele 1 and an allele 0, Z=0 a homozygous parent with two alleles 0, and Z=2 a homozygous parent with two alleles 1.
Notice that P(A=1∣Z=0)=0 and P(A=1∣Z=2)=1. Both cases will eliminate the possibility to observe a randomization and therefore cannot be included in estimating ACE. For the transmission from a heterozygous parent, Mendel's Law of Segregation implies that P(A=1∣Z=1)=1∕2, which indicates an equal probability to be assigned to either the control or the treatment. Therefore, both potential outcomes can be observed for trios with at least one heterozygous parent. This serves as the inclusion criteria for the TMT.
When violating Mendel's Law of Segregation, P(A=1∣Z=1) may be slightly shifted away from 1/2, which is known as “transmission distortion” (Zöllner et al. 2004). Even though various biological processes, such as biased segregation during meiosis and differential success of gametes in achieving fertilization, may lead to a skewed value of P(A=1∣Z=1), previous research reported a shifting value for human genome <0.005 genome‐wide and <0.0007 per locus (Zöllner et al. 2004). Therefore, we assume P(A=1∣Z=1)=1∕2 for the current work. If there's strong evidence for transmission distortion, one may use an estimate of P(A=1∣Z=1) to adjust the assignment probability in the TMT.
Now we state an essential assumption and lemma for estimating the ACE based on the randomization provided by alleles transmitted from heterozygous parents.
Assumption 2Under the trait model in Equation (1),
Assumption 2 follows Mendel's Law of Segregation in that each parent transmits one allele randomly to the child during meiosis, which proceeds the development of the child's trait. Then both P(A=a∣Z=1) and P(A=a∣Z=1,γ) equal 1/2. We use Assumption 2 to show Lemma 2, which is a classical assumption of causal inference under the potential outcomes framework.
Lemma 2Under the trait model in Equation (1), the potential outcomes of a child's trait, Y(0) and Y(1), are conditionally independent of the parental transmitted allele, A, given heterozygous parental genotype, Z=1, written as
For a specific parental side, (Ym(0),Ym(1))⊥Am|Zm=1 and (Yp(0),Yp(1))⊥Ap|Zp=1. We prove Lemma 2 in Appendix SA.2. In this proof, we first show the independence between the maternal and paternal transmitted alleles conditional on a heterozygous parent, that is, Ap⊥Am|Zm=1 and Am⊥Ap|Zp=1. This demonstrates how the randomization via heterozygous parents is independent from confounding factors such as population structure. Next we show that since the potential outcomes are functions of one parent's transmitted alleles and γ, Assumption 2 can be used to complete the proof for Lemma 2.
An equivalent statement of Lemma 2 is P(A=1∣Y(0),Y(1),Z=1)=P(A=1∣Z=1). Lemma 2 satisfies the unconfoundedness assumption in “intention‐to‐treat analysis” (Imbens and Rubin 2015). Intuitively, it implies that the allele transmission during meiosis is independent of the child's trait development later. Mathematically, Lemma 2 implies: (i) the equality in distributions of (Y(0)∣A=0,Z=1) and (Y(0)∣A=1,Z=1), and (ii) the equality in distributions of (Y(1)∣A=0,Z=1) and (Y(1)∣A=1,Z=1). Although these two pairs of equality in distributions are not testable since we cannot observe (Y(0)∣A=1,Z=1) and (Y(1)∣A=0,Z=1), they are necessary for inferring causal relationships. In practice, Lemma 2 is generally satisfied if the variation induced by meiosis precedes other events that may influence a particular trait, that is, the trait is developed after the allele transmission in meiosis.
TMT Parameter to Measure Causal Effects
2.5
We now define a parameter that captures the causal effect and we connect it to ACE(G→Y) through Theorem 1. Let N be the total number of heterozygous parents in that
Note that N≤2J. By conditioning on N, we derive an unbiased statistic in Section 2.6 for the following parameter.
Definition 4 (TMT parameter) Let δTMT be the TMT parameter such that:
Theorem 1Under the trait model in Equation (1), ACE(G→Y)≠0 if and only if δTMT≠0.
The trait model implies
so that
Under Assumption 1, either α0≤α1≤α2 or α0≥α1≥α2. Since 0<P(Ap=1∣Zm=1,N)<1 and 0<P(Am=1∣Zp=1,N)<1, then δTMT≠0 if and only if α0≠α1 or α1≠α2, which is equivalent to ACE(G→Y)≠0 by Definition 2.
Unbiased TMT Estimand
2.6
We now derive an estimator of δTMT and show it is unbiased.
Assignment Indicators
2.6.1
Let J‐length vectors W0={W0j} and W1={W1j} be assignment indicators for the control and the treatment,
Note that W0j,W1j∈{0,1,2}, and whenever W0j=W1j=0 this implies both parents are homozygous. We summarize all possible observations and corresponding assignments in Table 1 where no ambiguous assignment exists. These assignment indicators W0 and W1 are further utilized to construct the estimator of δTMT.
TMT Estimand
2.6.2
To estimate δTMT, we start from a preliminary statistic and prove its unbiasedness as follows.
Definition 5 (Preliminary statistic) Define the preliminary statistic dTMTnc as the sample mean difference between the treatment and the control, that is,
where the superscript “nc” stands for “not‐centered.”
Note that the factor of two results from the equal probability 1/2 that either parental allele is transmitted to the offspring under random segregation.
Lemma 3The statistic dTMTnc is unbiased given N in that
We prove Lemma 3 in Appendix SA.3. However, dTMTnc includes unnecessary variation in that the means of Yj are perturbed by the random variables W0j and W1j. One may then center each Yj by its overall mean and reduce the variance of dTMTnc by forming the centered statistic.
Definition 6 (Centered statistic) Let μ0,μ1 be the population means of potential outcomes Y(0),Y(1) conditioning on heterozygous parents such that:
and their average value
Define the centered statistic as
where the superscript “pc” stands for “population‐centered.”
When μc is unknown, we form an unbiased estimate μˆc, through which we further form the ultimate TMT statistic.
Definition 7 (TMT statistic) Define the population mean estimate μˆc such that
The ultimate TMT statistic used in practice involving no unobserved parameters, which we will denote by dTMT, is
Lemma 4The statistic μˆc is unbiased for μc in that
Also,
We prove Equation (4) in Appendix SA.4. Equations (5) and (6) are proved in Appendix SA.5 and are used to show that both centered statistics are unbiased.
Lemma 5Under the trait model in Equation (1), the statistics dTMT and dTMTpc are unbiased for δTMT in that
We prove Lemma 5 in Appendix SA.6. Calculating dTMT based on observed trio data enables the assessment of the existence of causality between the locus of interest and the target trait, which will be discussed in Section 2.8. Before that, we will need an estimate of the variance of dTMT in order to construct a test statistic.
Sampling Variance of the TMT Estimand
2.7
Deriving a sampling variance estimate and proving its operating characteristics is challenging in this setting. It has some properties that are different from a traditional randomized study. Note that when there is population structure or relatedness, the parental genotypes Zjm and Zjp may be dependent for each family j. Also, Zj and Zk may be dependent for any two families j≠k and for any combination of maternal and paternal genotypes. This means that unconditionally, the potential outcomes (Yjm(0),Yjm(1)) and (Yjp(0),Yjp(1)) may be dependent within each family j. This follows because these two pairs of potential outcomes are functions of the parental transmitted alleles Ajp and Ajm, respectively; they are therefore functions of the parental genotypes Zjp and Zjm, respectively. Likewise, (Yj(0),Yj(1)) and (Yk(0),Yk(1)) may be dependent between families j≠k for any combination of maternal and paternal genotypes.
We showed that Ajm⊥Ajp|Zjm=1 and Ajm⊥Ajp|Zjp=1 via Equation (SA2.c) from the proof of Lemma 2. Thus, within each family j, the potential outcomes are independent conditional on a heterozygous parent. In deriving the sampling variance here, we show several results that provide sufficient properties regarding between family covariances.
Another property of our setting that is different from a traditional randomized study is that the number of individuals assigned to treatment or control is random, whereas it would be predetermined in a traditional setting (Imbens and Rubin 2015). Further, a child can be assigned twice to treatment or control. Therefore, when deriving our sampling variance estimate, we first condition on these random events and then extend the estimate to take into account our particular randomization.
Sampling Variance Estimate
2.7.1
To develop a test based on dTMT, we first form a sampling variance estimate of the test statistic. Let J={1,2,…,J}. Define the following sets:
Notice that ∣T0∣+∣T1∣+2∣T00∣+2∣T11∣+2∣T01∣=N. Note that for j∈T01,(W1j−W0j)Yj=0 so these observations do not appear in dTMT. Denote means and variances of the other four sets as follows:
Definition 8 (Sampling variance estimate) We form the following estimates utilized in the ultimate sampling variance estimate.
From these, we construct the sampling variance estimate for dTMT as:
Unbiasedness of the Sampling Variance Estimate
2.7.2
We first discuss the expected value σˆTMT2 as an estimate for V(dTMTpc∣N) by Lemma 6 to Lemma 9. Then we show ∣V(dTMTpc∣N)−V(dTMT∣N)∣→0 with probability 1 as J→∞ in Lemma 10.
Lemma 6The estimate σˆTMT2 is unbiased for
Lemma 6 follows the fact that the sampling variance estimands σˆ02,σˆ12,σˆ002, and σˆ112 are unbiased estimators for their corresponding population variances V(Y∣T0),V(Y∣T1),V(2Y∣T00), and V(2Y∣T11) (Imbens and Rubin 2015).
We now show that Equation (8) is the variance of dTMTpc when δTMT=0 and it is a lower bound when δTMT≠0. To do so, we detail an assumption as follows.
Assumption 3Consider a sample of 2J parents including N heterozygous parents.
- (A) N∕(2J)→P(Z=1) with probability 1 when J→∞;
- (B)For j,k∈J and j≠k, P(Zj=a,Zk=b∣N)→P(Zj=a,Zk=b) with probability 1 when J→∞ for a,b∈{0,1,2};
- (C)For j,k∈J and j≠k,C(Zj,Zk)≥0.
Note that Assumption 3(A) is not an additional modeling assumption, but simply a formal statement that the empirical proportion of heterozygous parents (Z=1) converges almost surely to its probability as the sample size grows. Assumption 3(B) and Assumption 3(C) hold for any combination of maternal or paternal genotypes. By the Law of Total Probability, Assumption 3(B) implies that P(Z=a∣N)→P(Z=a) with probability 1 when J→∞ for a∈{0,1,2}. Also note that Assumption 3(C) is satisfied in prevalent population genetics models including the Identical‐By‐Descent (IBD) model, the co‐ancestry model, and so forth (Ochoa and Storey 2021).
Lemma 7Under Assumption 3, for j,k∈J and j≠k,
This holds for any combination of maternal or paternal genotypes.
We prove Lemma 7 in Appendix SA.7 and use it to prove the following lemma.
Lemma 8Write Dj=(W1j−W0j)(Yj−μc). The covariance between families
for j,k∈J and j≠k and when δTMT=0. Let ω=P(Z=1), which is the proportion of heterozygous parents. When δTMT≠0, as J→∞, the above covariance converges to a nonnegative value in that
Note that α0,α1 and α2 are parameters for the trait model in Equation (1). We prove Lemma 8 in Appendix SA.8 and use it to derive the following lemma.
Lemma 9Under the trait model in Equation (1) with arbitrary relatedness between families in a trio study:
- (A)When δTMT=0,σˆTMT2 is an unbiased estimator for V(dTMTpc∣N).
- (B)When δTMT≠0,σˆTMT2 underestimates V(dTMTpc∣N) as J→∞.
We prove Lemma 9 in Appendix SA.9. Lemma 9 justifies σˆTMT2 as an unbiased estimator for the variance of dTMTnc when δTMT=0 and a lower bound for the variance of dTMTnc when δTMT≠0. These properties lead us to derive a hypothesis test of δTMT=0 versus δTMT≠0 below. To further connect these results back to dTMT, we derive the following lemma.
Lemma 10 ∣V(dTMTpc∣N)−V(dTMT∣N)∣→0 with probability 1 as J→∞.
We prove Lemma 10 in Appendix SA.10.
Proposed Hypothesis Test
2.8
Here we derive a hypothesis test where the null hypothesis is H0:δTMT=0 and the alternative hypothesis is H1:δTMT≠0. Recall Theorem 1, where it was shown that δTMT≠0 if and only if ACE(G→Y)≠0, implying the hypothesis test is the desired test of causality. Under the trait model in Equation (1), the null hypothesis of no causality is true if and only if α0=α1=α2. Since dTMT is unbiased for δTMT in Section 2.6, we form a test statistic based on dTMT. The square root of the sampling variance estimate σˆTMT2 developed in Section 2.7 is also utilized as the standard error, which allows us to form a test statistic with a known null distribution.
We propose the following test statistic:
By the Central Limit Theorem, dTMT∕V(dTMT) is asymptotically Normal(0,1) when the null hypothesis, H0:δTMT=0, is true. When σˆTMT2≈V(dTMT), then τTMT is approximately Normal(0,1). The p value is calculated by
where X~Normal(0,1). Rather than using the Normal(0,1) distribution to calculate a p value, one could also use a permutation null distribution. Details for this permutation test are in Appendix SB.3. We will show later that the permutation null and the Normal(0,1) are similar in practice (Section 3.1).
TDT Is a Test of Causality
2.9
The TDT is applied when trios are recruited by identifying children with a particular disease. This sampling strategy is natural when studying a disease commonly contracted in childhood. As it is required to diagnose in advance whether a child has the disease or not, the TDT handles binary traits. Children who are sampled for TDT analysis have trait value Yj=1. Denoting the number of transmitted allele 0 by N0 and allele 1 by N1,
The above procedure is visualized in Figure 2. One then tests for a significant difference between N1 and N0. In the standard implementation (Spielman et al. 1993), McNemar's test (McNemar 1947) is performed based on a χ2 statistic
with p−value=P(X*≥XTDT2) where X* has a χ12 distribution.
Schematic of the TDT assignment procedure. If a heterozygous parent transmits an allele 0 to the affected child, an event (denoted by a solid dot) is recorded in the control. Alternatively, if transmitting an allele 1, an event is recorded in the treatment.
We show that the TDT is a test of causality via the following theorem, which relies on our causal inference framework for the TMT. We then show below that a component of the TDT statistic is an unbiased estimator of the causal effect analogous to dTMT.
Theorem 2The null hypothesis of TDT is true if and only if ACE(G→Y)=0.
We first construct a parameter analogous to the TMT parameter δTMT under the TDT setting by defining a TDT parameter δTDT, which involves a factor that accounts for the affected‐only sampling. Suppose the affected children are sampled from a population where the disease prevalence given parental heterozygote is
Note that η is not needed for the test but just for bridging the TDT and the TMT. Definition 9 (TDT parameter) Define the TDT parameter δTDT such that
A relevant difference between the TDT and the TMT for dichotomous traits is that the original TDT only samples trios with the child's trait value Yj=1, whereas the causal framework for the TMT is based on trios with the child's trait value Yj=1 and Yj=0 randomly sampled from the underlying population. To fit the TDT under the causal framework for the TMT, we derive causal properties of the TDT with respect to the underlying population with both Yj=1 and Yj=0. We explain the connection between δTDT and δTMT as follows. Lemma 11When the trait only takes two values, such that Y∈{0,1}, then δTDT=δTMT. We prove Lemma 11 in Appendix SA.11. It implies the equivalence between δTDT and δTMT for binary traits. It enables developing an estimand for δTDT in a similar format as the TMT estimand dTMT. To this end, we consider the following unbiased TDT estimand. Definition 10 (TDT estimand) The TDT estimand is defined as:
where Yj=1 for all j∈J. Note that dTDT=dTMT when Yj=1 for all j∈J. Lemma 12Under the trait model in Equation (1), dTDT is unbiased for δTDT in that
We prove Lemma 12 in Appendix SA.12.To connect dTDT to the original TDT test statistic, consider dTDT*=dTDT∕2=(N1−N0)∕N, which is the difference between the proportion of alleles 1 (N1∕N) and the proportion of alleles 0 (N0∕N) transmitted from affected child's heterozygous parents. The theory of McNemar's statistic (McNemar 1947) utilizes the following component
to derive a test statistic that equals χ2 with one degree of freedom as
Therefore, the original TDT is equivalent to testing H0:δTDT=0 versus H1:δTDT≠0. By Lemma 11, δTDT≠0 if and only if δTMT≠0. By Theorem 1, δTMT≠0 if and only if ACE(G→Y)≠0. Thus δTDT≠0 if and only if ACE(G→Y)≠0, which completes the proof for Theorem 2.
Detecting Causal Linkage via the TMT and TDT
2.10
Here, we consider the case where a genetic marker is linked to a causal variant, due to population‐level linkage disequilibrium (LD), within‐trio meiotic genetic linkage, or both. Suppose that marker c is causal while noncausal marker d is linked to c. Since c is causal the trait model from Equation (1) is Y=αc0ℐ(Gc=0)+αc1ℐ(Gc=1)+αc2ℐ(Gc=2)+γc, where the assumptions required for the TMT and TDT hold. A key assumption is Assumption 2, where it is assumed that P(Ac=a∣Zc=1)=P(Ac=a∣Zc=1,γc) for a∈{0,1}. For a causal locus, this is a plausible assumption as explained when that assumption was introduced. However, for locus d where Gc and Gd may be dependent, the assumption when modeling Y in terms of Gd may be violated.
We can write the trait model in terms of both genotypes Gc and Gd as:
where
Since c is causal, it follows that αc1−αc0≠0 or αc2−αc1≠0. Since d is not causal, it should be the case that αd0=αd1=αd2 = 0. In order to investigate the behavior of the TMT and TDT applied to marker d, we need to determine if P(Ad=a∣Zd=1)=P(Ad=a∣Zd=1,γd) for a∈{0,1}. (Note that when Zd≠1, then the transmission from this parent is not included in the TMT or TDT.) This may not be the case since Gc and Gd are dependent, so Ac and Ad may be dependent. Further, γd is a function of Ac (and Gc).
For a given parent, the dependence between Ad and γd is driven by Ac, so we will determine if P(Ad=a∣Zd=1)=P(Ad=a∣Zd=1,Ac) for a∈{0,1} by specifically calculating P(Ad=a,Ac=b∣Zd=1) for a,b∈{0,1}. When Zc=0, then Ac=0 with probability 1 and when Zc=2, then Ac=1 with probability 1. Thus,
for a∈{0,1}. The analogous result holds when Zc=2. One can conclude from this that when Zc=0 or Zc=2, then the required assumption to validly test marker d for causality holds for this parent–child combination.
This leaves the case of Zc=1. A given parent has two haplotypes from among the four possible haplotypes, {(0,0),(0,1),(1,0),(1,1)}. Let Hcd be the set of two haplotypes of the parent. Given that Zc=1,Zd=1, the possible haplotypes are Hcd={(0,0),(1,1)} or Hcd={(0,1),(1,0)}. Let
Also, let θcd be the meiotic recombination rate between loci c and d. It then follows that
Based on Equation (12), it is trivial to show that P(Ac=a∣Zc=1,Zd=1)=1∕2 for a∈{0,1} regardless of the values of ηcd and θcb. We have also shown earlier that P(A=a∣Z=1)=1∕2 for a∈{0,1} at any locus. It is therefore the case that
for a,b∈{0,1} whenever the quantities in Equation (12) equal 1/4, which holds if and only if either ηcd=1∕2 or θab=1∕2. Note that it does not need to be the case that ηcd=θab=1∕2; only one of the parameters needs to equal 1/2.
When markers c and d are in linkage equilibrium (LE), it will be the case that ηcd=1∕2, and when markers c and d are genetically unlinked (in the meiotic sense), it will be the case that θab=1∕2. When both of these do not hold, then the randomization at marker d is probabilistically dependent with the randomization at c and we say that these markers have randomization linkage. This implies that the null hypothesis of marker d is not true when marker c is causal. In this case we say that d is in causal linkage with Y, formalized in the following definition.
Definition 11 (Causal linkage of alleles at different loci) Suppose marker c is directly causal for Y, where Ac satisfies Definition 3 in that ACE(Ac→Y)≠0. Marker d at a different locus is in “causal linkage” with the trait Y if it has both LD and meiotic genetic linkage with marker c. We denote causal linkage by Ad↪Y.
If an allele A is itself directly causal for Y, then this allele A is trivially in causal linkage with Y since it is in complete linkage with itself. Thus, A→Y⇒A↪Y.
Definition 12 (Causal linkage of genotypes) For the child's genotype Gd=Adm+Adp, we say (Adm,Adp)↪Y if either Adm↪Y or Adp↪Y.
Lemma 13For the TMT, δTMT≠0 at marker d only if (Adm,Adp)↪Y. For the TDT, δTDT≠0 at marker d only if (Adm,Adp)↪Y.
We prove Lemma 13 in Appendix SA.13. It enables the identification of causal linkage by testing nonzero δTMT via the TMT and nonzero δTDT via the TDT. The test statistics dTMT and dTDT can be utilized to characterize the signal of causal linkage by genome‐wide TMT or TDT profiles shown in Section 3.3. Under simulations reflecting levels of LD observed in humans, it does not appear that causal linkage is an impediment to identifying causal loci. Under the theoretical calculations done above, it can be seen that the power is greater at c relative to d, and this is what we also observe empirically. It has been suggested that under Haldane's model of recombination one can break the genome into independent blocks, which would allow us to test for causality at the resolution of these blocks (Bates et al. 2020). However, empirical evidence suggests that Haldane's model of recombination does not hold in the human genome (Myers et al. 2005; Coop et al. 2008; Kivikoski et al. 2023).
Numerical Results
3
Evaluation of the TMT as a Test of Causality
3.1
Simulating Trio Genotypes
3.1.1
We simulated trio data to evaluate the accuracy and power of the TMT. We sampled the parental genotypes Z from a structured population based on a standard admixture model (Pritchard et al. 2000; Alexander et al. 2009; Cabreros and Storey 2019; Ochoa and Storey 2021) of four intermediate subpopulations (Appendix SB.1). We followed an established pipeline (Ochoa and Storey 2021) to simulate intermediate subpopulations while keeping the overall FST=0.2. (See Figure S2 for a visualization of pairwise relatedness in the structured population through co‐ancestry coefficients.) We then simulated the alleles (Aijm,Aijp) by randomly drawing one transmitted allele from corresponding parental genotypes (Zijm,Zijp), yielding the child's genotype Gij=Aijm+Aijp.
Simulating Child's Phenotypes
3.1.2
To simulate the child's trait Y={Yj} (see the schematic in Figure S3), we implemented a trait model with genetic effects from multiple loci in Equation (13). Let C be the index set of causal loci. The traits are generated according to:
where (ι,β) are scalar values, (α0,α1,α2) are genetic effects per locus (all causal loci share the same values of (α0,α1,α2)), and ϵj is the independent nongenetic random variation drawn from N(0,σe2). The random variable Ej is the nongenetic variation constructed to also be a function of the population structure. Therefore, Ej is a confounder between the genotypes and the trait, and the parameter β allows us to modulate the effect size of this confounder. Simulation details about Ej are summarized in Appendix SB.2, where we also show that Equation (13) satisfies the trait model in Equation (1). Note that the ratio (α2−α1)∕(α1−α0) can be any numerical value under the trait model in Equation (1). When (α2−α1)∕(α1−α0)=1, it is equivalent to an additive polygenic trait model. We display the result for (α2−α1)∕(α1−α0)=1 in Figure 3 and other (α2−α1)∕(α1−α0) values in Figure S4. We set ι=100 and note that any value of ι should work because ι would be subtracted from Yj after centering by μˆc in Equation (3).
The TMT is a valid test of causality. (A) dTMT is an unbiased estimator of δTMT. We followed Appendix SB.1 to generate 5000 trios from a structured population (FST=0.2) with 100,000 SNPs per individual and we randomly chose 100 loci to be causal. We simulated the child traits at the heritability levels (h2∈{0.3,0.6,0.9}) based on Appendix SB.2. Shown are 1000 randomly chose instances of dTMT−δTMT for each heritability level. (B) The TMT controls the FPR at the desired significance level per experiment. (C) The ROC curve from a randomly chosen simulated data set. Both the Normal(0,1) (solid lines) and permutation (dotted lines) null distributions are shown.
Point Estimation of ACE
3.1.3
We numerically confirmed the accuracy of dTMT as an unbiased estimate of the TMT parameter δTMT to detect nonzero ACE in Definition 1 across various values of h2 (Figure 3A). The unbiasedness of dTMT is immune to the confounding effects from the population structure and nongenetic factors that are associated with the population structure.
Causal FPR and the ROC Curve
3.1.4
We performed the TMT at each locus and calculated p values for both causal and noncausal SNPs by utilizing both the Normal(0,1) theoretical null distribution and the permutation null. We then computed the false positive rate (FPR) and the true positive rate (TPR) for causal loci at various significance thresholds between 0 and 1. The TMT strictly controls the FPR at the desired significance level across all levels of h2 (Figure 3B). The receiver operating characteristic (ROC) curve demonstrates the TMT having statistical power for detecting directly causal effects for all levels of h2, especially when h2 is high (Figure 3C). Besides the standard Normal(0,1) null distribution, we also used a permutation null distribution to calculate a p value by following Appendix SB.3. The two null distributions show similar performance (Figure 3B,C).
Validation of Theoretical Null Distribution
3.1.5
We started from a small sample of 100 trios and permuted Y to generate the null distributions of the test statistic τTMT from Equation (9). Our results validated that τTMT follows the Normal(0,1) distribution (Figure S5). Considering that population‐based trio studies usually contain at least a few hundred families, the Central Limit Theorem underlying the Normal(0,1) null distribution for τTMT seems to be applicable.
Evaluation of the TDT as a Test of Causality
3.2
As with the TMT evaluation, we followed the trio genotypes simulation pipeline in Appendix SB.1 to simulate 10,000 trios with 100,000 SNPs per individual. Among the child genotypes G={Gij}, we randomly chose 100 directly causal loci. To simulate a dichotomous trait, we first generated a continuous latent variable L={Lj} by
where the parameter configuration is the same as Equation (13) used for the TMT evaluation. We then used the probit model (see Appendix SB.4) to convert Lj into a dichotomous trait Yj according to a chosen disease prevalence. Children with Yj=0 are labeled unaffected and those with Yj=1 are labeled affected.
We applied the TDT on a per locus basis for trios with affected children and we also applied the TMT to a random subset of trios to match the sample size with the TDT. We computed the FPR and TPR across the range of significance levels. The TDT statistic dTDT also demonstrates it is an unbiased estimate of the ACE for binary traits (Figure 4A). At all significance levels, the TDT controls the FPR at the desired level (as does the TMT) across all values of h2 (Figure 4B). The ROC curves demonstrate similar and reliable statistical power of the TDT and the TMT for idenitifying causal loci (Figure 4C).
The TDT is a valid test of causality. (A) dTDT is an unbiased estimator of δTDT. We simulated 5000 affected‐only trios based on the description in the text. We then calculated dTDT and performed the TDT. Each box plot shows 1000 randomly chosen instances of dTDT−δTDT at each heritability level. (B) The TDT and TMT control FPR at the desired significance level. We additionally applied the TMT to 5000 randomly chosen trios composed of an equal number of unaffected and affected children. We calculated the FPR for the TDT and TMT across a range of thresholds as shown. (C) The ROC curve for a randomly chosen data set for both the TDT and TMT.
Genome‐Wide TMT Profile and Causal Linkage
3.3
To empirically validate the TMT in the presence of causal linkage, we used the software msprime (Baumdicker et al. 2022) to simulate trio genotypes with linkage between SNPs. We generated a sample of 5000 trios with 100,000 SNPs across 22 chromosomes per individual based on the American Admixture model (S. R. Browning et al. 2018) (simulation details in Appendix SB.5). The level of LD in our simulated genotypes (Figure S6) matches previous findings observed in the human genome (Reich et al. 2001; Dawson et al. 2002).
We first simulated a quantitative trait for the child phenotype to implement the TMT. Among the child genotypes G, we randomly chose 10 causal SNPs denoted by set C. We generated the child phenotypes Y by the polygenic trait model Yj=ι+∑i∈CbGij+ϵj where we drew ϵj from Normal(0,σe2) and we set σe2=1,ι=100. The coefficient for causal SNPs, b, is determined such that V(∑i∈CbGij)∕σe2=1 (i.e., the heritability h2=0.5). We displayed the TMT statistic dTMT and corresponding p values across all 100,000 SNPs as the genome‐wide TMT profile (Figure 5A). To compare with LE, we permuted trio genotypes to remove genetic linkage, regenerated the child trait using the same parameters in the polygenic trait model above, and conducted the TMT on a per locus basis to derive dTMT and p values (Figure 5B). Under both LD and LE scenarios, the TMT shows accurate ACE estimation at the causal loci and yields a Uniform(0,1) distribution of the p values when the null hypothesis of noncausality is true. We also simulated a dichotomous trait and calculated the TDT statistic dTDT and corresponding p values to construct the genome‐wide TDT profile under both LD and LE in Figure S7, where equivalent results were observed.
The genome‐wide TMT profile. (A) Randomization linkage scenario. (B) Independent randomization scenario, where the genotypes from (A) were independently permuted to remove the randomization linkage. For both scenarios, the TMT profile is presented as dTMT and −log10(p) at 100,000 SNPs across 22 chromosomes simulated by msprime. In the upper panel, the red dashed line is the true ACE at the causal loci and the black dashed line is for zero ACE at the noncausal loci. In the bottom panel, the gray dashed line is at p value = 5×10−8, which is commonly used as a p value threshold in GWAS.
TMT Versus Standard Approaches in the Presence of Confounding
3.4
Regression approaches, such as the ordinary least square (OLS) and LMM, identify genotypes that are associated with phenotypes while making no claims of causality. The OLS approach does not account for population structure, whereas the LMM approach does. In order to guarantee that association tests from these two approaches yield correct p value and FPR calculations, both approaches make a nontrivial assumption of exogeneity, which means that the nongenetic variation has zero covariance with the genetic variation. Thus, when confounding factors are present that conflate these two sources of variation, these methods may yield inaccurate results, even at the level of association.
To assess the behavior of the TMT, OLS, and LMM approaches in the presence of exogeneity, for each child j∈[1:J], we generated traits according to:
where C is the set of causal SNPs and κj is confounded with a subset of noncausal (and unassociated) SNPs. Full details are in Appendix SB.6. We implemented the TMT on a per marker basis to detect causal genotypes. We employed the LMM software GCTA (J. Yang et al. 2011) to detect associations. We also conducted simple linear regression between child trait and child genotypes on a per locus basis to test associations via OLS. In our simulation, the only SNPs that are associated with the trait are causal SNPs. Let C be the total number of causal SNPs and U the total number of noncausal confounded SNPs. Figure 6 displays the distribution of p values at noncausal SNPs where we induced confounding and 9U other random noncausal and unconfounded SNPs. Since these 10U SNPs are noncausal (and unassociated with the trait), their p values should be Uniform(0,1) distributed. In our simulations, only the TMT results in Uniform(0,1) distributed p values of noncausal SNPs. Both LMM and OLS report significantly small p values deviating from Uniform(0,1) for noncausal SNPs.
The distribution of −log10(p) at noncausal SNPs by the ordinary least square (OLS), the linear mixed model (LMM), and the TMT. The expected p values on the x‐axis are drawn from Uniform(0,1). Points are sorted by their quantiles.
We contrasted the TMT to family‐based association methods including the family‐based association test (FBAT) (Laird et al. 2000), the quantitative TDT (QTDT) (Abecasis et al. 2000) and the family‐based GWAS (FGWAS) (Young et al. 2022). It should be noted that these methods are designed to identify associations rather than establish causal effects, so the comparison presented here is for exploratory purposes only. We implemented an algorithm detailed in Appendix SB.1 to generate 3000 and 1000 trios with 100,000 SNPs per individual. At a randomly chosen causal locus, let G be the child genotype, Zm the maternal genotype, and Zp the paternal genotype. We simulated the child trait by Y=ι+bG+sκ+ϵ where b is the genetic effect size for the causal locus, s is the confounding effect size, and κ is the confounding effect correlated with parental genotypes by κ=(Zm+Zp−λ)⋅ℐ(Zm≠1,Zp≠1). Note that ℐ(⋅) is the indicator function; ι and λ are arbitrary scalars for which we set ι=100 and λ=10. We simulated ϵ~ Normal(0,1). For 3000 iterations, we generated child genotypes at the causal locus, simulated child traits, and applied the TMT, FBAT, QTDT and FGWAS to derive p values. This produced 3000 p values per method. We calculated empirically the type II error (β) as the proportion of p values greater or equal to a significance‐level α=0.05 among 3000 p values for each method, that is, β=(no. p values≥α)∕3000. Then we used 1−β to calculate the statistical power. In Figure S8, we plot power over a range of genetic effect sizes b. The left panel shows the scenario that the confounding effect size equals the genetic effect size (s=b) and the right panel shows a larger confounding effect size (s=2b). Both panels demonstrate that the TMT has higher power for detecting causal effects than FBAT, QTDT, and FGWAS under this simulation scenario.
Applying the TMT to Large‐Scale Association Studies
3.5
Large‐scale studies, such as the UK Biobank (Bycroft et al. 2018) and All of Us (All of Us Research Program Genomics Investigators 2024), involve hundreds of thousands of individuals. Many of the individuals in such large studies may be closely related. When studies contain closely related individuals (e.g., parent–child trios), researchers often retain only one of these individuals and exclude the others from the GWAS analysis. With the proposed TMT framework, one can instead set all closely related individuals aside and perform the GWAS on the remaining individuals. One can then follow up by validating the associations for causal effects on the related individuals using the TMT. Since the closely related individuals were not included in the initial GWAS analysis, one can even focus on a smaller set of the most significant SNPs from the GWAS analysis.
The UK Biobank has reported 107,162 related pairs (third degree or closer) (Bycroft et al. 2018) by examining kinship coefficients estimated with the software KING (Manichaikul et al. 2010) and discordant homozygote rates. Following the criteria from KING, we extracted individual pairs with kinship estimates around 1/4 (between 2−5∕2 and 2−3∕2) as first‐degree relatives but not monozygotic twins. We further filtered out individual pairs with a discordant homozygote rate > 0.1, which are likely to be siblings. We classified the remaining individual pairs as parent–offspring pairs and identified families with at least one child and two parents (one male and one female). We filtered families with parent–offspring pairs whose age difference is < 17. Through this procedure we identified 1064 trio candidates, similar to previous studies (B. L. Browning and S. R. Browning 2022). These sample sizes have reasonable power as shown in our simulations, especially on a well‐selected subset of SNPs identified in a traditional GWAS as causal candidates. We expect a comparable or larger sample size of parent–child trios for the All of Us data since it has reported a much larger size of closely related participants, specifically 215,107 pairs (second degree or closer) (All of Us Research Program Genomics Investigators 2024).
As evidenced by the thousands of citations of the original TDT publications (Terwilliger and Ott 1992; Spielman et al. 1993; Ewens and Spielman 1995), trio studies play an important role in genetics. In dbGaP, trio studies with reasonable sample sizes are available, presenting both whole‐genome sequencing data and phenotypes of interest. For example, the Pediatric Cardiac Genetics Consortium involves more than 5000 trios, together with a disease status related to congenital heart disease; it aims to recruit 10,000 trios with affected and unaffected children (Pediatric Cardiac Genomics Consortium and Writing Committee 2013). The Norwegian Mother and Child Cohort Study (MoBa) has collected around 44,000 genotyped trios (Magnus et al. 2016). The Finnish biobank FinnGen has sampled more than 12,000 genotyped trios (Kurki et al. 2023). The Estonian Biobank has sampled 14,063 genotyped trios (Milani et al. 2025). The total number of trios in these studies have similar sample sizes to our simulations. Therefore, in addition to the large‐scale studies containing many trios, the proposed methods should be useful to apply to existing and forthcoming studies aimed at directly sampling trios.
Discussion
4
In studies where genetic trios (two parents and a child) are sampled from a population, it is possible to measure the transmission of genetic variants from parent to child. We have shown that by leveraging the inherent randomization in this process, it is possible to construct a potential outcomes model and inference method to rigorously test for causality from genetic variant to trait in the child. We proposed the TMT in this scenario for all common trait types—quantitative, count, or dichotomous. We showed that within our framework, the well‐known TDT, where trios are sampled based on a dichotomous affected trait status in the child, is also a test of causality.
We demonstrated both theoretically and empirically that the TMT and TDT are robust to a wide range of confounders, including population structure and the confounding of nongenetic variation with the genetic signal. Our theoretical analysis showed how genetic relatedness and structure of parents within and between trios do not affect the accuracy of our tests. More generally, our theoretical results clarify why the analysis of the concordance of genetic transmission and traits within and among many pedigrees is robust to common confounders. We showed how LD and meiotic genetic linkage play roles in shared signal among neighboring SNPs.
The proposed framework is based on the potential outcomes model of causality in the context of randomized studies, which is the gold standard used in randomized controlled clinical trials required by the FDA. Mendelian randomization is a popular approach applied to the standard GWAS sampling strategy, where individuals from a population are sampled (as opposed to families). Mendelian randomization therefore does not observe the randomization process of genetic inheritance from parent to child. It is a method that implements instrumental variable regression using a genotype as the instrument to approximate exogeneity when regressing a trait on a biomarker, for example. A framework exists for the scenario where both parents and the child have phased genomes that utilizes a probabilistic independence definition of causality (Pearl 2009) and a computationally intensive simulation based null distribution (Bates et al. 2020), requiring Haldane's model of recombination (Lange 2002). A linear model of a child quantitative trait on the child and parent genotypes was related to our proposed framework.
Given that standard GWAS can have very large sample sizes, one may consider the relevance of trio‐based studies. We noted that many trio studies of a reasonable sample size are present in dbGaP. We also discussed that large studies, such as the UK Biobank (Bycroft et al. 2018) and All of Us (All of Us Research Program Genomics Investigators 2024), contain many instances of first‐degree relatives, including full parent–child trios. A possible strategy is to set aside the first‐degree relatives and perform a standard GWAS on the remaining individuals to identify causal candidates based on associations. One can then employ the TMT on the related individuals on a reduced set of candidate SNPs to identify causal variants. Going beyond trios to more general first‐degree relatives, such as siblings, future work could develop a framework where one can first predict missing parental genotypes by Mendelian imputation (Young et al. 2022) and then conduct the TMT based on this probabilistic imputation.
Supporting information
Supplementary Information
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Abecasis, G. R. , L. R. Cardon , and W. Cookson . 2000. “A General Test of Association for Quantitative Traits in Nuclear Families.” American Journal of Human Genetics 66, no. 1: 279–292.10631157 10.1086/302698 PMC 1288332 · doi ↗ · pubmed ↗
- 2Alexander, D. H. , J. Novembre , and K. Lange . 2009. “Fast Model‐Based Estimation of Ancestry in Unrelated Individuals.” Genome Research 19, no. 9: 1655–1664.19648217 10.1101/gr.094052.109PMC 2752134 · doi ↗ · pubmed ↗
- 3All of Us Research Program Genomics Investigators . 2024. “Genomic Data in the All of Us Research Program.” Nature 627, no. 8003: 340–346.38374255 10.1038/s 41586-023-06957-x PMC 10937371 · doi ↗ · pubmed ↗
- 4Allison, D. B. 1997. “Transmission‐Disequilibrium Tests for Quantitative Traits.” American Journal of Human Genetics 60, no. 3: 676–690.9042929 PMC 1712500 · pubmed ↗
- 5Bates, S. , M. Sesia , C. Sabatti , and E. Candès . 2020. “Causal Inference in Genetic Trio Studies.” Proceedings of the National Academy of Sciences 117, no. 39: 24117–24126.10.1073/pnas.2007743117 PMC 753365932948695 · doi ↗ · pubmed ↗
- 6Baumdicker, F. , G. Bisschop , D. Goldstein , et al. 2022. “Efficient Ancestry and Mutation Simulation With Msprime 1.0.” Genetics 220, no. 3: iyab 229.34897427 10.1093/genetics/iyab 229PMC 9176297 · doi ↗ · pubmed ↗
- 7Bochud, M. , A. Chiolero , R. C. Elston , and F. Paccaud . 2008. “A Cautionary Note on the Use of Mendelian Randomization to Infer Causation in Observational Epidemiology.” International Journal of Epidemiology 37, no. 2: 414–416.17881410 10.1093/ije/dym 186 · doi ↗ · pubmed ↗
- 8Browning, B. L. , and S. R. Browning . 2022. “Genotype Error Biases Trio‐Based Estimates of Haplotype Phase Accuracy.” American Journal of Human Genetics 109, no. 6: 1016–1025.35659928 10.1016/j.ajhg.2022.04.019PMC 9247820 · doi ↗ · pubmed ↗
