scTWAS: A powerful statistical framework for single-cell transcriptome-wide association studies

Zhaotong Lin; Chang Su

PMC · DOI:10.21203/rs.3.rs-6531106/v1·September 30, 2025

scTWAS: A powerful statistical framework for single-cell transcriptome-wide association studies

Zhaotong Lin, Chang Su

PDF

Open Access

TL;DR

This paper introduces scTWAS, a new method for analyzing gene expression in specific cell types using single-cell RNA data, improving the discovery of gene-disease associations.

Contribution

The novel contribution is a statistical framework, scTWAS, that enables cell-type-specific transcriptome-wide association studies using single-cell RNA sequencing data.

Findings

01

scTWAS improves prediction of genetically regulated gene expression across cell types in blood and brain tissues.

02

scTWAS identified more gene-trait associations in immune cell types compared to existing methods.

03

Application to Alzheimer’s disease revealed cell-subtype-specific associations, including MS4A6A and PPP1R37.

Abstract

Transcriptome-wide association studies (TWAS) have successfully identified genes associated with complex traits and diseases, but most rely on bulk transcriptome data, overlooking cell-type-specific contexts. Population-scale single-cell RNA sequencing data now enable such analyses, but present unique challenges due to strong noises, technical variations, and high sparsity. Here, we propose scTWAS, a statistical method to conduct cell-type-specific TWAS using single-cell data. Leveraging a latent-variable model and moment-based estimation to address the challenges of single-cell data, scTWAS consistently improves the prediction of genetically regulated gene expression across cell types in both blood and brain tissues. Compared to existing methods, scTWAS identified substantially more gene-trait associations across 29 hematological traits and three immune-related diseases in immune cell…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes2

MS4A6A PPP1R37

Proteins2

Diseases5

Alzheimer’s disease breast cancer Alzheimer's disease inflammatory immune-

Figures6

Click any figure to enlarge with its caption.

a](#F1) shows that scTWAS yielded a greater number of significant Stage 1 GReX prediction models compared to ZJ-TWAS and AG-TWAS (nominal *p*-value < 0.05). In [Figure 1b](#F1), scTWAS also exhibited higher power to detect gene-trait associations than the other two approaches, with the extent of the improvement varying by cell type abundance. Notably, the gain in power was more pronounced in less abundant cell types such as B_IN_ and Mono_C_ (0.23 and 0.30 higher than ZJ-TWAS, respectively). This highlights scTWAS’s ability to effectively leverage sparse data and enhance power in challenging s

a](#F2), scTWAS identified more significantly predicted genes across cell types, followed by ZJ-TWAS and AG-TWAS ([Figure 2b](#F2), Bonferroni corrected *p*-values < 0.05). To further demonstrate the cross-study prediction performance of scTWAS, we applied the prediction models trained on OneK1K to predict cell-type-specific bulk gene expression data from the DICE (database of immune cell expression, expression quantitative trait loci and epigenomics) project [[16](#R16)] for six immune cell types that match between two independent studies (Methods). [Figure 2b](#F2) shows that scTWAS generate

a](#F3) compares the prediction *p*-values in a fivefold cross-validation for microglia, demonstrating that scTWAS outperformed the other two methods across genes in generating more significant GReX models. To further validate prediction performance across studies, we applied the trained models to an independently collected snRNA-seq dataset on frontal cortex with 2.3 million nuclei from 427 individuals from the ROSMAP study, where 396 individuals have paired genotype data available [[35](#R35)]. As shown in [Figure 3b](#F3), scTWAS again achieved more significant prediction *p*-values compare

d](#F4) shows the estimated genotypes’ effects by scTWAS for *FLOT1* in Mono_NC_ with a prediction *p*-value of 4.3×10^−12^. In contrast, the prediction *p*-value is only 0.36 in CD4_NC_ ([Figure 4e](#F4)), even though it is the most abundant cell type, where stronger predictive performance would be expected if genetic regulation were present. This cell-type-specific signal in Mono_NC_ is further corroborated by the independent DICE study, where the expression quantitative trait loci (eQTLs) associated with *FLOT1* were found to be more significant in non-classical monocytes [[16](#R16)]. Thir

b](#F5)). For each of these cell (sub)types, we constructed pseudo-bulk count matrices similar to the procedure above and filtered out genes with less than 1,000 total counts. The number of samples, number of cells, and number of genes retained in the analysis for each cell (sub)type are provided in the Supplementary Table S2. The only exception is the analysis of microglia subtypes in [Figure 5](#F5), where we focused on the genes with at least 1,000 total counts in the microglia major cell type to expand the set of genes studied within the cellular context most relevant to AD mechanisms. For

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · Gene expression and cancer classification · Bioinformatics and Genomic Networks

Full text

Introduction

1

Genome-wide association studies (GWAS) have successfully identified numerous genetic variants associated with complex traits and diseases. However, the underlying biological mechanisms driving these associations remain unclear, as the majority of risk variants lie in non-coding regions and may exert their effects through gene regulation [1–3]. Transcriptome-wide association studies (TWAS) address this challenge by integrating GWAS data with gene expression to identify genes whose genetically regulated expression (GReX) is associated with the trait of interest [4–6]. It consists of two stages, which first trains a prediction model on the genetic regulation of gene expression with a reference panel of paired genotype and transcriptome data. In Stage 2, these prediction models are applied to GWAS data (either individual-level genotype and phenotype data or GWAS summary statistics) to test for associations between predicted gene expression and the trait of interest. Applications of TWAS have yielded numerous findings on the biological functions underlying disease pathogenesis and have further prioritized concrete targets for intervention and treatment [7–13].

Most TWAS to date use reference transcriptome data collected from disease-related tissues, such as peripheral blood mononuclear cells (PBMC) for immune diseases [13, 14] and brain tissues for Alzheimer’s disease (AD) [7, 8], and prioritize genes whose tissue-level expression is associated with complex diseases. However, tissues consist of biologically heterogeneous cell types, among which the genetic regulation of gene expression differ. For example, expression quantitative trait loci (eQTL) studies have demonstrated distinct genetic regulation across cell types in PBMC [15–17] and in brain [18]. Furthermore, specific cell types may contribute more significantly than others in disease pathogenesis. For example, heritability enrichment analysis shows that microglia carries a 7-fold higher genetic risk for AD compared to other brain cell types [19]. Beyond cell types, recent studies also highlight the importance of investigating cell subtypes to better understand disease mechanisms. For instance, a microglia subtype, known as disease-associated microglia, contributes uniquely to restricting neurodegeneration and has a higher expression of AD risk genes compared to other homeostatic microglia [20, 21] with potentially subtype-specific genetic regulation [18]. In contrast, most existing TWAS use reference gene expression data collected on bulk tissues, where mRNA molecules from heterogeneous cell types have been aggregated during data collection. As a result, they may fail to capture genetic regulation and disease associations specific to cell types and subtypes that play critical roles in diseases.

Conducting TWAS at the level of cell types is essential for elucidating the molecular and cellular mechanisms of diseases. While this could be performed with bulk data collected on sorted cell populations, the heterogeneity across cell subtypes will be obscured and the sample size may be insufficient for well-powered statistical analysis [16, 22]. The recently available population-scale single-cell RNA-sequencing data (scRNA-seq) provide an unprecedented opportunity to address these limitations. These data profile hundreds of individuals, each with thousands of cells per individual in disease-related tissues, identifying a dozen of major cell types and tens of refined cell subtypes [e.g. 17, 18, 23, 24]. This enables the modeling of genetic regulation and the identification of disease-associated genes across diverse cellular contexts.

However, the unique characteristics of scRNA-seq data, including strong noises, technical variations, and high sparsity, challenge existing statistical methodologies proposed for TWAS. Count data from scRNA-seq measure underlying gene expression levels in single cells with strong noises and technical variations from the sequencing experiments [25]. If left unaddressed, these measurement noises would attenuate the estimates of genetic effects, while technical variations across cells introduce artificial differences in mean expression and heteroskedastic variance, even among cells with the same underlying gene expression levels. The counts are also highly sparse due to the limited starting materials in single cells and the low capture efficiency of the technology [26]. In contrast, most existing TWAS methods were developed for bulk RNA-seq data and are therefore better suited to normalized and continuous expression traits with relatively homoskedastic variance [e.g. 4, 5]. This has motivated recent cell-type-specific TWAS analysis to apply heuristic normalization to single-cell data before applying existing TWAS methods [18, 27–30]. However, these normalization strategies may lack theoretical justification and fail to account for the unique characteristics of single-cell data [25, 26, 31], resulting in reduced accuracy in learning the genetic regulation of gene expression and diminished power to detect associations between GReX and diseases.

To address these challenges, we propose scTWAS, a statistical method for conducting cell-type-specific TWAS with scRNA-seq data. scTWAS introduces a latent-variable model that simultaneously models the effects of genetic regulation on the underlying gene expression and the measurement noises and technical variations introduced by sequencing experiments. By disentangling biological gene expression from technical noises, this model enables accurate quantification and estimation of GReX. We further develop a moment-based estimation framework that predicts GReX while explicitly addressing heteroskedastic variance across samples. These designs lead to improved accuracy of GReX prediction and higher statistical power of identifying TWAS genes in cellular contexts of interests with scRNA-seq data.

We applied scTWAS on multiple population-scale scRNA-seq datasets to construct GReX prediction models and test the association between GReX and traits in relevant cellular contexts. Our results demonstrate that scTWAS consistently outperformed existing approaches in GReX prediction across cell types and independent datasets. The improvement is particularly pronounced in settings with fewer cells, where GReX is more difficult to predict. Using GReX models for immune cell types, scTWAS systematically identified more associations with hematological traits and uncovered cell-type-specific regulatory mechanisms underlying genes linked to immune-related diseases, providing new findings that have not been previously reported by GWAS or TWAS with bulk gene expression data. Using GReX models for brain cell types, scTWAS revealed the cellular contexts in which both established and previously unreported genes may contribute to AD pathogenesis. Notably, it uncovered regulatory mechanisms of MS4A6A specific to a subtype of disease-associated microglia that are not shared with surveilling or reactive microglia, underscoring the importance of refined cellular resolution and single-cell-based TWAS for understanding the functional impact of disease-associated variants. The GReX prediction models for immune and brain cell types will be made publicly available upon acceptance (Data availability).

Results

2

Method overview

2.1

scTWAS consists of two stages: Stage 1 constructs cell-type-specific GReX prediction models for each gene using the genotype data and single-cell gene counts from a specific cell type; and Stage 2 tests for the association of predicted GReX and a trait of interest using the trained Stage 1 models and GWAS summary statistics or individual-level data.

In Stage 1, we model pseudo-bulk counts, defined as the total gene counts across all single cells per cell type and individual, in order to mitigate the high sparsity in single cells and to represent individual-level gene expression. For a given gene-cell type pair, let $[eqn]$ denote the underlying expression level, defined to be the abundance of mRNA molecules from this gene-cell type pair in individual $[eqn]$ . Let $[eqn]$ be a vector of cis-genotype (cis-single nucleotide polymorphisms, SNPs) in individual $[eqn]$ and $[eqn]$ be a vector of covariates (e.g. intercept, sex, age, genotype principal components). We assume that

[eqn]

where $[eqn]$ denotes a nonnegative distribution with mean $[eqn]$ and variance $[eqn]$ , and $[eqn]$ and $[eqn]$ denote the coefficients for genotypes and covariates, respectively. This assumes that the underlying gene expression is a random variable whose expectation depends on genotypes and covariates, and whose variance reflects the biological variations across individuals. It makes no parametric assumption on the distribution of $[eqn]$ , accommodating common assumptions such as Gamma distribution [26, 32]. We call this the expression model, which models the genetic regulation of gene expression in the cell type. In practice, $[eqn]$ is a latent variable to be measured by sequencing experiments. To model the measurement process, we use $[eqn]$ to denote the observed pseudo-bulk counts from individual $[eqn]$ and $[eqn]$ to denote the sequencing depth for individual $[eqn]$ , defined as the total counts across all genes from individual $[eqn]$ in the cell type of interest. We propose the following measurement model:

[eqn]

This assumes that the observed gene counts are measured with Poisson sampling noises [25, 26], whose magnitude further depends on sequencing depths, a technical factor known to substantially vary across samples [26]. Together, (1)–(2) give the expression-measurement model [25, 31], which accurately quantifies the genetic regulation of underlying gene expression with $[eqn]$ while accounting for noises and technical variations in measured gene counts. To build GReX prediction models (i.e., $[eqn]$ ), we propose a penalized method-of-moments approach to estimate $[eqn]$ , where an iteratively re-weighted least squares (IRLS) method is employed to optimize model parameters and to improve statistical efficiency. Details are presented in Section 4.1.

In Stage 2, we test associations between predicted GReX and the trait of interest in the cell type of interest. When individual-level GWAS data (both genotype and phenotype) are available, cell-type-level gene expression can be directly predicted using SNP coefficients estimated from Stage 1 (i.e., $[eqn]$ ), and its association with the phenotype can be tested using regression methods [4]. In most cases where individual-level data are not available, we can perform the test using GWAS summary statistics and an LD reference panel [5].

In summary, scTWAS integrates single-cell data with genetic information to identify cell-type-specific gene-trait associations. By leveraging an expression-measurement model and a moment-based estimation framework, scTWAS effectively accounts for noises and variations from the measurement process and models the genetic regulation of underlying expression levels, eliminating the need for heuristic normalization. It provides more accurate GReX predictions, which then improves the statistical power for association testing with complex traits. This method enhances our ability to uncover biologically meaningful insights at the cell-type level, advancing our understanding of molecular and cellular mechanisms of complex diseases.

scTWAS achieves higher prediction accuracy and power on simulated data

2.2

To evaluate the Stage 1 prediction performance and Stage 2 power of scTWAS, we simulated realistic single-cell gene expression data based on the characteristics of OneK1K study [17], which is one of the first population-scale scRNA-seq datasets, comprising 1.27 million cells from 14 immune cell types across 982 individuals. We evaluated the performance of scTWAS against two representative approaches used in recent cell-type-specific TWAS analyses with scRNA-seq data: AG-TWAS [29] and ZJ-TWAS [28]. Both approaches used an elastic net model for Stage 1 prediction, but applied different normalization steps to single-cell data. Briefly, AG-TWAS first normalized the single-cell count data before aggregating it based on the suggestion of a recent benchmarking paper [33], while ZJ-TWAS aggregated the single-cell count data first and applied normalization on the pseudo-bulk count (Supplementary Figure S1, see Section 4.2). We simulated GReX with parameters estimated on real data, and the observed single-cell counts based on an expression-measurement model [25, 26, 31] (see Section 4.4), which captured the sparse and heteroskedastic nature of scRNA-seq data, allowing for a realistic evaluation of method performance. We selected six cell types (CD4_NC_, CD8_ET_, B_IN_, Mono_C_, NK_R_ and Plasma), each representing different levels of cell type abundance in real data, with a mean of 472, 209, 84, 40, 10, and 5 cells per individual, respectively. See Section 4.3.1 for the definition of cell types.

Our results demonstrate that scTWAS consistently outperformed AG-TWAS and ZJ-TWAS in both prediction accuracy and power across cell types with varying abundance. We evaluate Stage 1 GReX prediction models by the prediction p-value from regressing observed expression on the estimated GReX in a fivefold cross-validation (see Section 4.5.1). Figure 1a shows that scTWAS yielded a greater number of significant Stage 1 GReX prediction models compared to ZJ-TWAS and AG-TWAS (nominal p-value < 0.05). In Figure 1b, scTWAS also exhibited higher power to detect gene-trait associations than the other two approaches, with the extent of the improvement varying by cell type abundance. Notably, the gain in power was more pronounced in less abundant cell types such as B_IN_ and Mono_C_ (0.23 and 0.30 higher than ZJ-TWAS, respectively). This highlights scTWAS’s ability to effectively leverage sparse data and enhance power in challenging scenarios, which are common among immune cell types in the OneK1K dataset, where 9 out of 13 have abundances less than or equal to B_IN_ (Supplementary Table S2). Even in highly sparse cell types such as NK_R_ and Plasma, where all methods exhibited low power, scTWAS still maintained an advantage (around 0.03 and 0.05 higher than ZJ-TWAS and AG-TWAS respectively). These results suggest that scTWAS is more powerful in uncovering gene-disease association across cellular contexts of varying abundance, with particularly strong gains in disease-relevant but less abundant cell types and subtypes. This includes microglia (subtypes) for AD as an example, which will be further demonstrated in Section 2.6.

We also evaluated the methods’ type-I errors in two scenarios: one where gene expression was independent of the cis-SNPs (i.e., a null Stage 1), and another where the phenotype was independent of the gene expression (i.e., a null Stage 2). All methods, including scTWAS, successfully controlled the type-I error rates across all cell types (Supplementary Table S1). Under a null Stage 1, all methods yielded out-of-sample $[eqn]$ close to 0 as expected (Supplementary Figure S2).

scTWAS enhances GReX prediction across cell types in blood and brain tissues

2.3

To benchmark GReX prediction models on real data, we applied scTWAS, AG-TWAS, and ZJ-TWAS to build cell-type-specific models using two population-scale single-cell datasets on PBMC [17] and brain tissues [18], and evaluated their performance based on both within-study and cross-study prediction accuracy (Methods). In the first application, we used OneK1K genotype and scRNA-seq data to build prediction models for 13 immune cell types. The average number of cells per individual varied widely across cell types, from 472 in CD4_NC_ to 5 in Plasma, with a median of 49 (Supplementary Table S2). We built prediction models for genes whose total counts in the cell type are greater than 1*,*000.

We assessed the accuracy of scTWAS, AG-TWAS and ZJ-TWAS GReX prediction models using the prediction p-value in a fivefold cross-validation (see Section 4.5.1). As shown in Figure 2a, scTWAS identified more significantly predicted genes across cell types, followed by ZJ-TWAS and AG-TWAS (Figure 2b, Bonferroni corrected p-values < 0.05). To further demonstrate the cross-study prediction performance of scTWAS, we applied the prediction models trained on OneK1K to predict cell-type-specific bulk gene expression data from the DICE (database of immune cell expression, expression quantitative trait loci and epigenomics) project [16] for six immune cell types that match between two independent studies (Methods). Figure 2b shows that scTWAS generated a median of 16% and 32% more significantly predicted genes compared to ZJ-TWAS and AG-TWAS, respectively (raw p-values < 0.05 for Pearson correlation, Supplementary Figure S3).

In addition to immune cell types, we further benchmarked scTWAS on GReX prediction models for brain cell types. We leveraged data from the Religious Orders Study and Memory and Aging Project (ROSMAP) study, which collected multi-omics data on postmortem brain samples, including paired whole genome sequencing data [34] and single-nucleus RNA-seq (snRNA-seq) data on dorsolateral prefrontal cortex across 424 individuals [18]. The nuclei have been annotated at two nested levels: major brain cell types and their subtypes [18, 21]. Accordingly, we built cell-type-specific prediction models for 6 brain cell types and 75 subtypes (Methods). The abundances of major brain cell types range from 1,403 cells per individual in excitatory neuron to 135 cells per individual in oligodendrocyte precursor cells (OPC) with a median of 525 cells per individual. For subtypes, the median abundance is 29 cells per individual (SD = 58) (Supplementary Table S2).

Figure 3a compares the prediction p-values in a fivefold cross-validation for microglia, demonstrating that scTWAS outperformed the other two methods across genes in generating more significant GReX models. To further validate prediction performance across studies, we applied the trained models to an independently collected snRNA-seq dataset on frontal cortex with 2.3 million nuclei from 427 individuals from the ROSMAP study, where 396 individuals have paired genotype data available [35]. As shown in Figure 3b, scTWAS again achieved more significant prediction p-values compared to the existing methods across the six major cell types. Furthermore, at the subtype level (Figure 3c, Supplementary Figure S7), scTWAS identified markedly more predictable genes than the other methods across the 75 subtypes, with the largest gains observed in microglia subtypes, showing up to a 30-fold increase. This highlights the advantages of scTWAS in predicting GReX for cellular contexts with sparser data and fewer cells. Taken together, these results on OneK1K and ROSMAP studies (Figures 2 and 3) demonstrate that scTWAS consistently generates better GReX prediction across tissue types and across datasets, which lays the foundation for increased power in testing gene-trait associations.

scTWAS discovers more associations with blood cell traits in immune cell types

2.4

Building on the improved Stage 1 performance of scTWAS on real data, we next benchmark its power to detect gene–trait associations in Stage 2 for hematological traits. Using the prediction models for immune cell types from Section 2.3, we performed cell-type-level TWAS of 29 hematological traits with GWAS summary statistics (Methods). Figure 4a compares the total number of significant gene-trait associations identified by three methods across cell types (also see Supplementary Figure S4 for a breakdown by individual GWAS). Consistent with the Stage 1 results, scTWAS identified the most associations across cell types, followed by ZJ-TWAS and AG-TWAS, and more significant gene-trait associations were identified in more abundant cell types. We further classified the significant genes into loci (Methods). If a locus harbors a genome-wide significant GWAS variant, it is considered known; otherwise, it is considered novel. scTWAS increased the total number of loci by an average of 17% and 45% across cell types compared to ZJ-TWAS and AG-TWAS, respectively, including 18% and 56% more novel loci (Figure 4b). The most substantial increases were observed in monocytes (both classical and non-classical), with the relative increment in the number of loci being up to 41% compared to ZJ-TWAS and 95% compared to AG-TWAS.

Additionally, we compared our cell-type-level TWAS results with bulk TWAS on the same traits [36], which used bulk RNA-seq data from whole blood on 922 European individuals to train the GReX models. As shown in Figure 4c, cell-type-level TWAS uniquely identified novel loci that were missed by tissue-level TWAS, with scTWAS identifying the greatest number of novel loci overall. For example, HHEX is a member of the homeobox gene family, which plays an important role in embryogenesis and hematopoietic progenitor development [37]. While both tissue-level and cell-type-level TWAS identified its association with mean platelet volume, eosinophil count, eosinophil percentage and lymphocyte percentage, the cell-type-level TWAS revealed that these associations were specific to CD8_ET_ and NK cells. This cell-type specificity underscores the unique value of cell-type-level TWAS in pinpointing the relevant cellular contexts where gene expression influences blood cell traits, providing more detailed and biologically meaningful insights than tissue-level analyses. Furthermore, scTWAS uniquely identified an association between HHEX expression and monocyte percentage in CD8_ET_ cells. These findings suggested a potential broader role for HHEX in hematopoiesis and immune cell regulation, consistent with its previously reported functions [38, 39].

Another example is the DGCR6 locus, where scTWAS uniquely identified its association with mean platelet volume and platelet count in Mono_C_, an association that was not discovered with ZJ-TWAS, AG-TWAS or bulk level TWAS. scTWAS revealed that the GReX of DGCR6 was positively associated with platelet count and negatively associated with mean platelet volume, corroborating recent findings that increased DGCR6 expression was implicated in reducing mean platelet volume [40]. Furthermore, DGCR6, located in the 22q11.2 region, is a candidate for involvement in DiGeorge syndrome (22q11.2 deletion syndrome) pathogenesis [41]. Individuals with 22q11.2 deletion syndrome have been shown to exhibit lower platelet counts and increased mean platelet volume [42]. Our cell-type-specific findings may offer new insights into the regulatory roles of DGCR6, particularly in how its expression in Mono_C_ might influence platelet dynamics, potentially informing the mechanisms underlying 22q11.2 deletion syndrome.

scTWAS reveals mechanistic insights into immune-mediated diseases in immune cell types

2.5

Rheumatoid arthritis (RA), systemic lupus erythematosus (SLE), and asthma are complex immune-mediated diseases that collectively affect millions of individuals worldwide and pose a significant public health burden [43–45]. Genome-wide association studies have identified hundreds of risk loci associated with these conditions [46–48], underscoring their highly polygenic architectures, but the functional interpretation of these loci remains limited due to the complexity of regulatory mechanisms and cellular heterogeneity in disease-relevant tissues. Immune cell types play a central role in the etiology of these diseases, as T cells, B cells, plasma cells, and monocytes have all been implicated through functional and transcriptomic studies, highlighting the importance of cell-type-specific analysis in understanding disease mechanisms [23, 24, 49].

Here, we performed cell-type-level TWAS on these three immune-related diseases using cell-type-specific GReX models trained on OneK1K immune cell data from Section 2.3. Transcriptome-wide significance was determined using the Bonferroni-corrected p-value threshold to account for the total number of genes tested in each cell type.

In our analyses, scTWAS identified 64, 41 and 111 genes associated with RA, SLE and asthma, respectively, in at least one of the 13 immune cell types. This corresponded to 11, 15 and 34 more genes than those identified by AG-TWAS, and 6, 6, and 16 more than those identified by ZJ-TWAS. Moreover, scTWAS uniquely identified 8 genes for RA, 11 for SLE, and 29 for asthma that were not identified by either AG-TWAS or ZJ-TWAS. We highlight a few examples here while a complete list of significant genes is provided in Supplementary Tables S3-S5. scTWAS uniquely identified an association between IRF5 and RA and SLE in Mono_C_ and NK cells, uncovering its cell-type-specific roles in disease mechanisms. IRF5 is a well-documented transcription factor critical in the regulation of immune responses, particularly in the production of type I interferons, and has been proposed as a therapeutic target for autoimmune diseases [50]. Previous studies have shown that IRF5 mediates joint inflammation by integrating Toll-like receptor signals and regulating proinflammatory cytokine and chemokine production in a mouse model [51, 52], and its inhibition suppresses SLE onset and progression in mice [53, 54]. Tissue-based TWAS also identified its association with RA and SLE [12, 55]. Our results corroborate and extend prior evidence by identifying specific immune cell types, Mono_C_ and NK cells, through which IRF5 may contribute to SLE pathogenesis. These results support further investigation of IRF5, particularly its inhibition in Mono_C_ and NK cells, as a promising direction for SLE drug discovery. scTWAS also uniquely identified SULF2 as significantly associated with asthma in plasma cells. Known for its role in modifying heparan sulfate proteoglycans, SULF2 can impact key biological processes such as cell signaling, inflammation, and tissue remodeling, including airway remodeling, a hallmark of chronic asthma [56]. Notably, while genetic variants in SULF2 gene have been associated with several diseases, there is currently no reported association with asthma or other lung diseases. Our findings provide evidence linking SULF2 to asthma pathogenesis, specifically through its potential role in plasma cells, and highlight it as a candidate for future functional studies and therapeutic exploration.

Next, we investigated the similarity and specificity of scTWAS findings across immune cell types. 21 out of 64 RA genes identified by scTWAS were specific to a single cell type, while the majority were shared across two or more cell types (also see Supplementary Figure S6 for a detailed UpSet plot). Several possibilities could lead to the cell-type specificity observed in scTWAS results. First, a gene may be expressed exclusively in one cell type. For example, ZFP57 was highly expressed (total counts > 1*,*000) only in CD4_NC_ cells and was associated with RA in that cell type. Second, a gene may be expressed in multiple cell types, but a predictive GReX model could be obtained only in a specific cell type, potentially due to cell-type-specific genetic regulation. For example, FLOT1 was expressed in all 13 immune cell types, but only the scTWAS GReX model in Mono_NC_ was predictive, and significant associations with RA, SLE and asthma were observed only in this cell type. Figure 4d shows the estimated genotypes’ effects by scTWAS for FLOT1 in Mono_NC_ with a prediction p-value of 4.3×10^−12^. In contrast, the prediction p-value is only 0.36 in CD4_NC_ (Figure 4e), even though it is the most abundant cell type, where stronger predictive performance would be expected if genetic regulation were present. This cell-type-specific signal in Mono_NC_ is further corroborated by the independent DICE study, where the expression quantitative trait loci (eQTLs) associated with FLOT1 were found to be more significant in non-classical monocytes [16]. Third, the association between GReX and disease may be inherently cell-type-specific. While PSMB9 was significantly predictable in several cell types, its association with RA was observed only in CD8_ET_ cells. For example, although PSMB9 was more significantly predicted in Mono_C_ with a p-value 6.9×10^−14^ (Figure 4g), as compared to 3.3×10^−9^ in CD8_ET_ (Figure 4f), there was no corresponding signal in the RA GWAS, resulting in a non-significant TWAS association in Mono_C_. Previous research has reported its cell-type-specific involvement in idiopathic inflammatory myopathies in peripheral blood [57], and further investigation into its function in RA is warranted. Lastly, differences in statistical power due to cell type abundance or sample sizes can also lead to apparent cell-type specificity in detecting gene-disease associations. A similar trend of cell-type similarity and specificity was also observed in SLE and asthma (Supplementary Figure S6).

To further explore potential cell-type-specific biological pathways revealed by our cell-type-level TWAS, we performed pathway enrichment analysis with g:Profiler [58] on significant genes for each disease to test for enrichment of gene ontology (GO) terms. We uncovered several interesting cell-type-specific pathways. In RA, we identified GO terms related to the production of molecular mediator of immune response exclusively in plasma cells. In an earlier study, plasma cells have been found to express receptor activator of NF- $[eqn]$ B ligand (RANKL), a key mediator in osteoclastogenesis [59]. This expression promotes bone resorption and contributes to periarticular bone loss, highlighting the role of plasma cells in both immune response and bone destruction in RA. In SLE, we found GO terms related to extracellular vesicles (EVs) and exosomes in Mono_NC_, which play key roles in intercellular communication, carrying inflammatory signals and autoantigens, and thus a crucial physiological and pathological role in SLE [60]. A recent study observed that the Mono_NC_ cells were prominently affected by the exosome, suggesting potential major cell types responsible for exosome treatment. In asthma, GO terms related to IgE binding and IgE receptor activity were identified in Mono_C_. IgE is a key mediator in allergic responses, and these findings suggest that Mono_C_ may play an important role in mediating the allergic inflammatory responses through IgE binding [61, 62].

Cell-subtype-level scTWAS analysis identifies AD-gene associations in microglia subtypes

2.6

Alzheimer’s disease (AD) is estimated to affect over 6 million elderly individuals in the United States, and this number is projected to double by 2060 [63]. Genetics play an important role in the pathogenesis of AD, with a high heritability of 60%-80% [64]. This motivates large-scale GWAS, which have identified over 70 independent risk loci significantly associated with AD [65]. However, the functional consequences of these AD associated variants remain to be characterized, especially in the cellular contexts relevant to the pathophysiological processes of AD. Recent advances in highly multiplexed snRNA-seq have facilitated the characterization and identification of these AD-associated cell subtypes in the brain [21, 66], such as subpopulations of microglia [21, 67–69], astrocytes [21, 70, 71], and oligodendrocytes [21, 71, 72]. Gene expression have been found heterogeneous across cell subtypes, and evidence from paired genotype and snRNA-seq data further suggests unique patterns of genetic regulation across cell subtypes [18].

Here, we leveraged scTWAS to investigate the cellular and functional mechanisms of AD-associated variants in brain cell subtypes. Using GReX models trained on ROSMAP dorsolateral prefrontal cortex data (Section 2.3), scTWAS identified a total of 78 genes associated with AD risk across all cell subtypes, with 33, 18, 20, 21, 22, and 5 genes in subtypes of excitatory neuron, oligodendrocyte, inhibitory neuron, astrocyte, microglia, OPC, respectively (Supplementary Table S6). Figure 5 shows that scTWAS reveals the cellular contexts through which AD-associated genes may contribute to disease risk. First, 19 genes were found across multiple cell types (Supplementary Table S6). Notably, ARL17B and KANSL1 were identified in all major cell types, suggesting that the mechanisms involving these genes may operate broadly across cellular contexts. For example, KANSL1 was previously suggested to be involved in transcription regulation, a fundamental process likely relevant across diverse cell types [73]. In contrast, many scTWAS signals were cell-type-specific. For example, CR1, a gene from an established AD locus [74], showed strong associations with AD exclusively in oligodendrocyte subtypes (Figure 5a, Supplementary Table S6). Prior studies suggest that CR1 may contribute to the clearance of $[eqn]$ -amyloid ( $[eqn]$ ) peptides and modulate immune-related pathways [74, 75], with additional evidence for its function in neuronal contexts [76]. These scTWAS findings suggest that CR1 may play a cell-type-specific role in AD pathogenesis through its activity in oligodendrocytes, motivating additional studies of its regulatory mechanisms in this context. Pathway enrichment analysis further reveals the enrichment of immune-related biological processes across cell types, the involvement of lipid pathways in astrocyte (regulation of lipid metabolic process) and microglia (protein-lipid complex assembly), and pathways identified in only one cell type such as regulation of catalytic activity and negative regulation of receptor-mediated endocytosis in microglia.

We next focused on microglia, the brain cell type most enriched for AD genetic risks [19, 77], to investigate the potential heterogeneity of AD pathogenesis within this cell population. The ROSMAP dataset annotates 16 subtypes of microglia with distinct biological states, including Mic.13 which is associated with lipid and enriched for signatures of disease-associated microglia, and Mic.15 which is enriched for inflammation-related signatures in human [21]. First, several genes were identified as significantly associated with AD across multiple microglia subtypes (Figure 5b). These include genes located in established AD GWAS loci, whose functional roles in immune pathways or in microglia are supported by prior literature (e.g., PICALM [78], HLA-DRB1 [79], EPHA1-AS1 [77], PTK2B [80], BIN1[81]). In addition, these results included genes such as DLEU1, RASGEF1C that have not been previously reported in AD GWAS loci [18], warranting future functional studies to investigate their potential roles in AD pathogenesis within microglia.

Next, we highlight two examples that demonstrate subtype-specific signals captured by scTWAS. First, scTWAS identified MS4A6A as significantly associated with AD specifically in Mic.13, whereas its Stage 1 prediction and Stage 2 association tests were generally insignificant across other microglial subtypes (Figure 5b). MS4A6A has been previously implicated in AD TWAS [18, 82] and is known to play an important role in microglia [83, 84]. Notably, despite the relatively low abundance of Mic.13, the association signal was strong (p-value = 1.4×10^−10^), suggesting that AD-associated variants may exert strong and subtype-specific regulatory effects on MS4A6A in disease-associated microglia, which were not observed in surveilling or reacting microglia.

As a second example, scTWAS identified PPP1R37 as significantly associated with AD exclusively in Mic.15. Located near the APOE locus, the function of PPP1R37 in brain cell types has not been well characterized. Interestingly, a previous study in a Japanese cohort reported that a PPP1R37 SNP remained nominally significant for AD risk even after adjusting for the number of APOE $[eqn]$ 4 alleles [85], suggesting a potentially independent role of this gene in disease susceptibility. A hippocampal TWAS also identified an association between PPP1R37 and AD [86]. Despite the low abundance of Mic.15, the association signal of PPP1R37 was highly significant (p-value = 2.9×10^−18^), suggesting that AD-associated variants may regulate PPP1R37 specifically in inflammatory microglial states. Further supporting its role in AD, scTWAS also identified a significant association between PPP1R37 and AD in Ast.5, a subtype of astrocytes associated with inflammatory response [21]. Notably, the TWAS effect sign differed across contexts—positive in Mic.15 and negative in Ast.5, suggesting that PPP1R37 may exert cell-state-specific regulatory effects in AD, involving both inflammatory microglia and astrocyte, and highlighting the need for further investigation. Finally, both of these associations, MS4A6A in Mic.13 and PPP1R37 in Mic.15 and Ast.5, were identified only by scTWAS and not detected by ZJ-TWAS or AG-TWAS, highlighting scTWAS’s ability to capture biologically meaningful and cell-state-specific signals in AD.

Discussion

3

In this study, we proposed scTWAS, a novel statistical framework for conducting transcriptome-wide association studies using single-cell data. The method is built on an expression-measurement model [25, 31], which quantifies the genetic regulation of underlying latent gene expression while accounting for noises and technical variations in observed gene counts. To estimate model parameters, we developed an efficient iteratively re-weighted least squares algorithm that accounts for the heteroskedasticity inherent in single-cell data. Compared with existing TWAS analyses that use single-cell data [28, 29], scTWAS avoids the need for heuristic data normalization and instead models the data following a more biologically motivated data-generating mechanism. This allows us to estimate genetically regulated expression more accurately, leading to improved power in the downstream association testing. Numerical studies showcased the improved performance of scTWAS compared to other approaches that rely on data normalization across various cell types, with particularly large gains in those with lower abundance.

We applied scTWAS to two population-scale single-cell datasets to train GReX models and perform gene-based association tests across various traits and diseases. Using data from the OneK1K study, we built cell-type-specific GReX models across 13 immune cell types and conducted TWAS for 29 hematological traits and three autoimmune diseases: rheumatoid arthritis, systemic lupus erythematosus, and asthma. We also applied scTWAS on data from the ROSMAP study to build GReX models across 6 major cell types and 75 cell subtypes in brain tissues, and conducted TWAS for Alzheimer’s disease. The cell-(sub)type-level TWAS analysis replicated disease-associated genes that were previously identified in tissue-level studies, while providing more granular insights into the cellular contexts in which these associations arise. For example, IRF5, a well-established gene associated with RA and SLE, was specifically identified in monocytes and natural killer cells. Importantly, because both the genetic regulation of gene expression and the effect of GReX on disease might differ across cell types, bulk tissue-level analyses may overlook these genes. scTWAS identified novel gene-trait associations that were not identified in previous tissue-level studies, highlighting the necessity of conducting cell-type-level TWAS. Moreover, we also identified cell-type-specific disease-associated genes that appeared uniquely in one specific cell (sub)type. For example, MS4A6A, a well-established AD gene, was identified only in Mic.13, a microglial subtype associated with lipid metabolism and enriched for disease-associated microglia signatures, highlighting its cell-type-specific role and therapeutic relevance. These results show that scTWAS successfully captured known signals and pointed to previously underappreciated cellular mechanisms in disease pathogenesis, which are potentially masked in bulk-level analyses.

While scTWAS provides a powerful statistical framework for cell-type-specific TWAS, several limitations should be noted. First, although the method can be conceptually viewed as a two-stage least squares approach, the resulting associations are not necessarily causal. Due to linkage disequilibrium, multiple genes in the same region may appear associated, and additional challenges such as horizontal pleiotropy [87–89] or shared regulatory variants across genes and cell types can complicate interpretation [90, 91]. Future extensions incorporating fine mapping and accounting for horizontal pleiotropy may help improve causal inference in this context. Second, in its current form, scTWAS trains a separate GReX prediction model for each cell type independently. While this design captures cell-type-specific genetic effects, it does not borrow strength across cell types, which could be valuable given the shared genetic regulation of gene expression in related cell populations [27]. A promising future direction is to extend the framework to jointly model multiple cell types, with the goal of improving prediction accuracy by sharing information across cell types while still preserving cell-type-specific signals. Third, scTWAS treats all SNPs equally when building the prediction model. However, incorporating functional annotations to prioritize SNPs with regulatory potential, such as chromatin accessibility or transcription factor binding, could further enhance prediction performance as well as model interpretability. This strategy has shown promise in related methods [68], and we leave this as an important direction for future work.

In summary, we introduce scTWAS, a powerful and biologically grounded statistical framework for performing TWAS using single-cell data. As population-scale scRNA-seq datasets become increasingly available, scTWAS provides a promising tool for building cell-type-specific GReX models and identifying cell-type-specific gene-trait associations for more diseases in disease-relevant cell types. We anticipate that scTWAS will facilitate novel biological insights, refine the mapping from genotype to phenotype, and ultimately contribute to a deeper understanding of disease mechanisms at the cellular level.

Methods

4

scTWAS method

4.1

scTWAS adopts a moment-based estimation framework to estimate $[eqn]$ for GReX prediction. Based on models Eq. (1) and Eq. (2), we observe the following moment conditions:

[eqn]

[eqn]

This motivates our scTWAS Stage 1 model:

[eqn]

where $[eqn]$ ’s are independent mean-zero random errors. The variance of $[eqn]$ is given by Eq. (4), which may vary greatly across samples due to technical variations in sequencing depths. This leads to heteroskedasticity and statistically inefficient estimates if left unaddressed. To mitigate this, we propose to estimate $[eqn]$ by minimizing the following weighted least squares loss function with an elastic net penalty on $[eqn]$ :

[eqn]

where we use $[eqn]$ as weights to downweight observations with higher variance and upweight those with lower variance, thereby enhancing statistical efficiency. We implement the optimization of Eq. (6) with R package glmnet [92], where the response is defined as $[eqn]$ , the design matrix consists of genotypes $[eqn]$ and covariates $[eqn]$ scaled by sequencing depth $[eqn]$ and $[eqn]$ , and no penalty is imposed on $[eqn]$ . We further use cross-validation to select the hyperparameter $[eqn]$ which minimizes the out-of-sample loss function.

As shown in Eqs. (3) and (4), the weights $[eqn]$ ’s used in Eq. (6) depend on unknown parameters $[eqn]$ (and $[eqn]$ ). To address this, we propose an IRLS algorithm that iterates between updating $[eqn]$ ’s and updating $[eqn]$ . In specific, we update $[eqn]$ given $[eqn]$ ’s by optimizing Eq. (6), and update $[eqn]$ ’s given $[eqn]$ by setting $[eqn]$ and leveraging Eqs. (1) and (4). Here $[eqn]$ denotes the over-dispersion parameter, encoding a mean-variance dependency as commonly observed in bulk and single-cell RNA-seq data [26, 32]. We estimate $[eqn]$ and the initial weights using R package sctransform [26] under a null model with no genetic effects. The IRLS algorithm is presented in Algorithm 1, which typically converges within ten iterations. See Supplementary Methods for more details on the algorithm.

For genes that are well-predicted (as evaluated by p-values of prediction in fivefold cross-validation [4, 5]), Stage 2 of scTWAS tests for the association between predicted gene expression and the trait of interest. When only the GWAS summary statistics are available, the association between predicted expression and the trait can be expressed as [5]:

[eqn]

where $[eqn]$ is the estimated SNP weights in Stage 1, $[eqn]$ is a vector of GWAS summary $[eqn]$ -scores of the corresponding cis-SNPs, and $[eqn]$ is the linkage disequilibrium (LD) correlation matrix of the cis-SNPs, which can be estimated by a reference panel. Under the null hypothesis of no association, $[eqn]$ follows a standard normal distribution and the p-value can be calculated analytically.

Other TWAS methods under comparison

4.2

In this work, we compare scTWAS with two existing approaches that use single-cell data for TWAS analysis, referred to as AG-TWAS [29] and ZJ-TWAS [28], respectively. Both of these approaches rely on the traditional TWAS framework [4, 5] developed for bulk gene expression data, which assumes the following Stage 1 model on normalized gene expression data $[eqn]$ :

[eqn]

where $[eqn]$ denotes random noise. Throughout, we used elastic penalty to estimate $[eqn]$ by minimizing the following loss function

[eqn]

while other penalty functions on $[eqn]$ can also be used. This framework has been widely applied to normalized and continuous bulk gene expression traits with relatively homoskedastic variance [4, 5].

To be compatible with the above traditional TWAS framework, AG-TWAS and ZJ-TWAS propose to apply various normalization strategies prior to applying the framework (Supplementary Figure S1). In specific, AG-TWAS begins by applying counts per million (CPM) normalization to the cell-by-gene count matrix for a specific cell type, followed by log transformation and mean aggregation across cells for each individual; ZJ-TWAS first aggregates single-cell counts by individual to create a pseudo-bulk count matrix, then applies TMM normalization (via the R package edgeR) and log transformation. Finally, both approaches perform inverse normal transformation on the aggregated expression values.

Data sets

4.3

Genotype and scRNA-seq data

4.3.1

The OneK1K study

The OneK1K study comprises scRNA-seq data from 1.27 million peripheral blood mononuclear cells (PMBCs) collected from 982 donors [17]. Cells have been classified into 14 immune cell types by the authors, including plasma cells (Plasma); dendritic cells (DC); two B cell types – immature and naïve B (B_IN_) cells and memory B (B_Mem_) cells; two natural killer cell types – natural killer (NK) cells and NK recruiting (NK_R_) cells; two monocyte cell types – classical (Mono_C_) and nonclassical (Mono_NC_) monocytes; three CD4^+^ T cell types – CD4 naïve and central memory T (CD4_NC_) cells, CD4^+^ T cells with an effector memory or central memory phenotype(CD4_ET_), and CD4^+^ T cells expressing SOX4 (CD4_SOX4_); and three CD8^+^ T cell types – CD8 naïve and central memory T cells (CD8_NC_), CD8^+^ T cells with expression of S100B (CD8_S100B_), CD8^+^ T cells with an effector memory phenotype (CD8_ET_) [17]. For each cell type, we constructed a pseudo-bulk count matrix by extracting unique molecular identifier (UMI) counts of cells from the cell type of interest and aggregating them for each individual. Genes with pseudo-bulk UMI counts less than 1,000 were filtered out. Finally, CD4_SOX4_ was excluded from the analysis due to the limited number of genes remained. The number of samples, number of cells, and number of genes retained in the analysis for each cell type is provided in Supplementary Table S2.

The genotype data were obtained from the authors, where 483,255 autosomal SNPs were genotyped and imputed to the Haplotype Reference Consortium panel, with only SNPs having an imputation $[eqn]$ being retained. We further performed quality control using PLINK 2.0 [93] to retain variants with minor allele frequency > 0.01 and in Hardy-Weinberg equilibrium (--maf 0.01 --hwe 1e-6).

The ROSMAP study

The ROSMAP study is designed to study aging and dementia with multi-omics profiling of human brain, including whole genome sequencing (WGS) data [34] and population-scale single-nucleus RNA-sequencing data on postmortem brain samples [18, 35]. We performed similar quality control of genotype data using PLINK 2.0. To train Stage 1 prediction models, we used the snRNA-seq data on dorsolateral prefrontal cortex from [18], which profiled 1.65 million nuclei and identified a total of 16 major cell types and 95 cell subtypes. Among 437 individuals with snRNA-seq data, 424 have both genotype and snRNA-seq data available. We excluded major cell types with less than 10*,*000 cells and focused on the remaining six major cell types (excitatory neurons, oligodendrocytes, inhibitory neurons, astrocytes, microglia, and OPCs) and their 75 subtypes. For example, there are 16 subtypes of microglia under various biological states in this dataset (Figure 5b). For each of these cell (sub)types, we constructed pseudo-bulk count matrices similar to the procedure above and filtered out genes with less than 1,000 total counts. The number of samples, number of cells, and number of genes retained in the analysis for each cell (sub)type are provided in the Supplementary Table S2. The only exception is the analysis of microglia subtypes in Figure 5, where we focused on the genes with at least 1,000 total counts in the microglia major cell type to expand the set of genes studied within the cellular context most relevant to AD mechanisms. For cross-study validation, we further used an independently generated snRNA-seq dataset on prefrontal cortex [35], that profiled 2.3 million nuclei and identified seven major cell types. Among 427 individuals with snRNA-seq data, 396 have both genotype and snRNA-seq data available. We focused on excitatory neurons, oligodendrocytes, inhibitory neurons, astrocytes, microglia, and OPCs to validate the GReX models trained on data from dorsolateral prefrontal cortex [18].

The DICE project

The DICE project collects genotype data and bulk RNA-sequencing data on FACS-sorted immune cells from 91 subjects [16]. We used the genotype data from the original publication which have been imputed with the 1000 Genomes Project phase 3 [94]. We used the bulk gene expression data from the original publication which have been normalized with Transcript Per Million (TPM) [95]. Among the 15 immune cell types profiled in DICE, we matched six of them to the most similar cell types in OneK1K, including classical Monocytes, non-classical Monocytes, CD16 NK cells, Naive B cells, Naive CD4 T cells (matched to CD4_NC_ in OneK1K) and Naive CD8 T cells (matched to CD8_NC_ in OneK1K).

GWAS summary data

4.3.2

GWAS summary statistics for hematological traits were generated by Neale Lab (https://www.nealelab.is/uk-biobank) using UK Biobank (UKB) White ancestry data ( $[eqn]$ ). GWAS summary statistics for the three immune-related diseases (RA, SLE and Asthma) were obtained from [46] ( $[eqn]$ ), [47] ( $[eqn]$ ), [48] ( $[eqn]$ ) respectively. These GWAS summary statistics were then imputed using ImpG [96] with OneK1K samples as the reference panel. GWAS summary statistics for Alzheimer’s disease were obtained from [65], which were imputed using ImpG with ROSMAP WGS data as the reference panel. We further conducted quality control on the imputed GWAS data, including filtering out imputed SNPs with imputation accuracy $[eqn]$ , and any SNPs with ambiguous alternative alleles.

Simulation settings

4.4

We performed simulation studies to evaluate scTWAS with genotype and gene expression data of 982 individuals from the OneK1K study. We simulated the raw count for the $[eqn]$ -th cell of the $[eqn]$ -th individual with a Poisson-Gamma model, and the trait of interest of the $[eqn]$ -th individual as follows:

[eqn]

[eqn]

[eqn]

In Eq. (10), $[eqn]$ were the standardized genotypes of the cis-SNPs around a gene, and 10 causal cis-SNPs were randomly selected with the corresponding weights generated from Uniform((−3×10^−6^, −5×10^−9^) ∪ (5×10^−9^, 3×10^−6^)). The intercept $[eqn]$ was set to 1×10^−4^, and $[eqn]$ was set to 10. In Eq. (11), the sequencing depth of the $[eqn]$ -th cell $[eqn]$ in a specific cell type was directly extracted from the OneK1K real data. A random intercept $[eqn]$ was simulated to introduce correlations among cells within the $[eqn]$ -th individual. In Eq. (12), the GWAS trait was simulated such that the gene expression accounted for 5% of its variance.

We also considered two scenarios to evaluate the type I error rate of the methods. In the first setting, we simulated the expected underlying gene expression level $[eqn]$ independent of the genotype data (cis-SNPs). In the second setting, we simulated the GWAS trait from a standard normal distribution independent of the gene expression.

We selected six cell types (CD4_NC_, CD8_ET_, B_IN_, Mono_C_, NK_R_ and Plasma) for the simulation, representing different levels of cell type abundance in the real data. 20 genes were randomly selected and 50 replicates were performed for each gene-cell type pair, with a total of 1,000 simulation replicates in a specific cell type. In each replicate, we only replaced the count of the specific gene with the simulated single-cell count, while keeping the counts of other genes the same as the OneK1K real data.

With the cell by gene count matrix, we applied the proposed scTWAS, and AG-TWAS and ZJ-TWAS described in Section 4.2. To evaluate the power of TWAS analysis, we defined a significant gene-trait association as one where the prediction p-value < 0.05 in fivefold cross-validation in Stage 1 (see Section 4.5.1), and the Stage 2 association was also significant with a p-value < 0.05*/ $[eqn]$ *, where $[eqn]$ corresponded to the number of tests in the real data application in each cell type. To evaluate the type-I error, we defined a significant gene-trait association as one where both the GReX prediction in cross-validation and the Stage 2 association were significant (nominal p-value < 0.05).

Real data applications

4.5

Prediction model training and evaluation

4.5.1

With the aforementioned genotype and scRNA-seq data in Section 4.3.1, we built Stage 1 prediction models for each gene-cell type pair. For each gene, cis-SNPs within 500 kB upstream of the gene transcription start site and 500 kB downstream of the transcription end site were used as predictors. Additionally, to ensure a high overlap between the SNPs in the prediction models and those in the GWAS summary statistics, we only considered SNPs present in the UKB GWAS summary statistics for the OneK1K models, and those present in the AD GWAS [65] for the ROSMAP models. For OneK1K models, we adjusted for covariates including sex, age and six genotype principal components provided by the authors. For ROSMAP models, we adjusted for covariates including the first three genotype PCs, age at death, sex, post-mortem interval, study (ROS or MAP), and a hidden batch effect covariate inferred by principal component analysis (Supplementary Methods, Supplementary Figures S8-S9). R package glmnet was used to train the elastic net model for all three methods with the mixing value $[eqn]$ and the tuning parameter $[eqn]$ was selected using fivefold cross-validation.

We evaluated the within-study prediction accuracy by fivefold cross-validation. For AG-TWAS and ZJ-TWAS, we regressed the observed normalized expression on the predicted expression across all predicted folds as implemented in FUSION-TWAS. For scTWAS, we used weighted linear regression to regress the observed pseudo-bulk counts on the predicted $[eqn]$ with estimated weights across all predicted folds. Only genes significantly predicted in Stage 1, defined as those whose estimated GReX is predictive of observed gene expression in cross-validation (Bonferroni corrected p-value < 0.05), were included in Stage 2 analysis to test for the association with the trait of interest.

We also evaluated the cross-study prediction performance of three methods. For the GReX models of immune cell types trained with OneK1K data, we used the external validation dataset from the DICE project [16]. We focused on 6 cell types that match between OneK1K and DICE, including classical monocytes, non-classical monocytes, NK cells, naive B cells, naive CD4^+^ T cells and naive CD8^+^ T cells. For each cell type and method considered, we used the prediction model trained with the corresponding cell type from OneK1K data and the DICE genotype data to predict cell-type-specific gene expression for DICE subjects. We calculated the Pearson correlation between predicted gene expression and log(TPM+1) normalized bulk gene expression data from DICE, and compared the number of significantly predicted genes (nominal p-value < 0.05, Pearson’s correlation) across three methods. For the GReX models of brain cell types trained with ROSMAP data on dorsolateral prefrontal cortex [18], we used an independently generated snRNA-seq dataset on prefrontal cortex from [35] for external validation. We focused on six cell types that overlap between two studies (excitatory neuron, oligodendrocyte, inhibitory neuron, astrocyte, microglia, and OPC) and used the prediction models trained with the corresponding cell type in [18] and the genotype data to predict cell-type-specific gene expression in [35]. The prediction performance of each method is evaluated by prediction p-values, as defined above, following the approach used in cross-validation.

TWAS analysis using GWAS summary statistics

4.5.2

We applied cell-type-level TWAS to GWAS summary data using FUSION-TWAS [5, 97]. FUSION-TWAS leverages GWAS summary statistics, a gene expression predictive model, and an LD reference panel to test the association between genetically predicted gene expression and a phenotype of interest, analyzing one gene at a time. Specifically, we conducted TWAS analysis on 29 hematological traits and 3 immune-related diseases using Stage 1 models trained on OneK1K immune cell types with scTWAS, AG-TWAS, and ZJ-TWAS, respectively, and the OneK1K study as the LD reference panel. We also conducted TWAS analysis on Alzheimer’s disease using Stage 1 models trained on ROSMAP brain cell (sub)types with scTWAS, AG-TWAS, and ZJ-TWAS, respectively, and the ROSMAP study as the LD reference panel. Finally, to account for the large number of hypotheses tested, we used Bonferroni corrected p-value threshold of $[eqn]$ to identify significant gene-trait associations in a specific cell type of interest, where $[eqn]$ was the number of genes significant in at least one prediction model.

To define a locus, we started by ranking significant genes according to their TWAS p-values, with higher ranks assigned to smaller p-values. For the top-ranked gene, we defined a locus as a 1 million base pair window around it and included all genes within this region as part of the locus. This process was repeated for the remaining significant genes until all significant genes were associated with loci.

GO enrichment analysis

4.6

In Section 2.5, we used gost in R package gprofiler2 (v0.2.2) to perform enrichment analysis for GO with one-sided Fisher exact tests and selected GO terms with adjusted p-value < 0.05 using the default ‘g_SCS’ method.

Supplementary Material

This is a list of supplementary files associated with this preprint. Click to download.

SupplementTable.xlsx

supp.pdf

Bibliography97

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Emilsson V. Genetics of gene expression and its effect on disease. Nature 452, 423–428 (2008).18344981 10.1038/nature 06758 · doi ↗ · pubmed ↗
2Nicolae D. L. Trait-associated snps are more likely to be eqtls: annotation to enhance discovery from gwas. P Lo S genetics 6, e 1000888 (2010).20369019 10.1371/journal.pgen.1000888 PMC 2848547 · doi ↗ · pubmed ↗
3Maurano M. T. Systematic localization of common disease-associated variation in regulatory dna. Science 337, 1190–1195 (2012).22955828 10.1126/science.1222794 PMC 3771521 · doi ↗ · pubmed ↗
4Gamazon E. R. A gene-based association method for mapping traits using reference transcriptome data. Nature genetics 47, 1091–1098 (2015).26258848 10.1038/ng.3367 PMC 4552594 · doi ↗ · pubmed ↗
5Gusev A. Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics 48, 245–252 (2016).26854917 10.1038/ng.3506 PMC 4767558 · doi ↗ · pubmed ↗
6Mai J., Lu M., Gao Q., Zeng J. & Xiao J. Transcriptome-wide association studies: recent advances in methods, applications and available databases. Communications Biology 6, 899 (2023).37658226 10.1038/s 42003-023-05279-y PMC 10474133 · doi ↗ · pubmed ↗
7Sun Y. A transcriptome-wide association study of alzheimer’s disease using prediction models of relevant tissues identifies novel candidate susceptibility genes. Genome medicine 13, 1–11 (2021).33397400 10.1186/s 13073-020-00808-4PMC 7784364 · doi ↗ · pubmed ↗
8Guo S. & Yang J. Bayesian genome-wide twas with reference transcriptomic data of brain and blood tissues identified 141 risk genes for alzheimer’s disease dementia. Alzheimer’s Research & Therapy 16, 120 (2024).