CellPolaris: Transfer Learning for Gene Regulatory Network Construction to Guide Cell State Transitions
Guihai Feng, Xin Qin, Jiahao Zhang, Wuliang Huang, Yiyang Zhang, Wentao Cui, Yao Chen, Shirui Li, Wenhao Liu, Yao Tian, Yana Liu, Jingxi Dong, Ping Xu, Zhenpeng Man, Guole Liu, Zhongming Liang, Xinlong Jiang, Xiaodong Yang, Pengfei Wang, Ge Yang, Hongmei Wang, Xuezhi Wang

TL;DR
CellPolaris is a new tool that uses transfer learning to build gene regulatory networks and identify key transcription factors that guide cell fate transitions.
Contribution
CellPolaris introduces a transfer learning framework for GRN construction and TF perturbation simulation, enabling cell-type-specific regulatory insights.
Findings
CellPolaris accurately constructs gene regulatory networks using transcriptomic data and transfer learning.
The framework identifies master transcription factors critical for cell fate transitions with high overlap to experimentally validated regulators.
CellPolaris simulates developmental consequences of TF perturbations, such as Rfx2 knockout during spermatid differentiation.
Abstract
Cell fate decisions are orchestrated by intricate gene regulatory networks (GRNs), which govern gene expression with precise spatiotemporal control. However, accurately capturing context‐specific nature of gene regulation remains challenging, particularly when integrating multi‐omics data at bulk and single‐cell level across diverse cellular contexts. Here, we present CellPolaris, a unified computational framework designed to decode the roles of transcription factors (TFs) in developmental processes. CellPolaris performs TF‐centered GRN construction, master TF identification, and TF perturbation simulation. By leveraging transfer learning, the framework generates tissue‐specific or cell‐type‐specific GRNs using pre‐constructed high‐confidence GRNs of diverse contexts and requires only transcriptomic data as input. Using these learned GRNs, CellPolaris identifies underlying master TFs…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
FIGURE 1
FIGURE 2
FIGURE 3
FIGURE 4
FIGURE 5
FIGURE 6- —Strategic Priority Research Program of the Chinese Academy of Sciences
- —National Key Research and Development Program of China10.13039/501100012166
- —CAS Project for Young Scientists in Basic Research
- —National Natural Science Foundation of China10.13039/501100001809
- —Informatization Plan of Chinese Academy of Sciences
- —China Postdoctoral Science Foundation10.13039/501100002858
- —Science and Technology Commission of Shanghai Municipality10.13039/501100003399
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSingle-cell and spatial transcriptomics · Pluripotent Stem Cells Research · Gene Regulatory Network Analysis
Introduction
1
Lifespan development is orchestrated by gene regulatory networks (GRNs) that integrate environmental and cellular signals to guide biological processes [1, 2, 3, 4]. Transcription factors (TFs) are the core nodes of these GRNs. They bind to cis‐regulatory elements of target genes (TGs) in a spatiotemporal manner to precisely regulate cell fate decisions, including cellular differentiation and reprogramming [5, 6]. Particularly, some TFs that play a determinant role in establishing and maintaining the cell are referred to as master TFs [7]. The regulatory mechanisms of these master TFs and their corresponding GRNs are highly complex and dynamic, both during normal development and in disease pathogenesis. Therefore, deciphering TF‐centered GRNs, identifying master TFs governing cell identity, and predicting the effects of TF perturbations on cell states are the critical yet challenging goals in life sciences research.
For GRN construction, Chromatin Immunoprecipitation followed by Sequencing (ChIP‐Seq) is a reliable method for analyzing TF‐TG regulatory relationship of interested TFs. However, this method cannot recapitulate a comprehensive GRN, since it can only provide TF‐TG information for one pre‐selected TF per experiment. Using ChIP‐Seq to map network involving dozens or hundreds of TFs would be a labor‐intensive work requiring stringent experimental conditions. To address these limitations, computational methods have been developed to infer GRNs from omics data, such as GRNBoost2 [8], SCENIC [9], ICAnet [10], DeepGRNCS [11], GNNLink [12], and SimiC [13, 14, 15]. Co‐expression‐based approaches infer TF–TG regulatory relationships by capturing covariation in transcriptomic data. Additionally, integrating epigenomic data, such as ATAC‐Seq, has enhanced GRN inference by providing insights into TF binding sites, as demonstrated by tools like PECA2 [16], scREG [17], SCENIC+ [18], and DeepTFni [19]. For identifying master TFs, a common strategy is to compare GRNs between discrete cell states, such as the initial and target states, to pinpoint key regulators, as implemented in tools like Mogrify [20], CellNet [21], and ANANSE [22]. Alternative method, like CEFCON, leverages the continuous dynamics of cell fate transitions to identify critical TFs [23]. In terms of simulating TF perturbations, CellOracle [24] integrates ATAC‐Seq‐derived GRNs with single‐cell transcriptomic data to model the effects of TF perturbations on TG expression and cell fate. Despite these advancements, there is still no unified framework that enables users to efficiently perform tasks such as constructing cell state‐specific GRNs, identifying master TFs, and predicting perturbation effects, all while requiring minimal input, such as transcriptomic data alone. Developing such an integrated framework would greatly facilitate research in this field.
In this study, we present CellPolaris, a unified framework to elucidate the roles of TFs in developmental processes, including TF‐centered GRN construction, master TF identification, and TF perturbation simulation. A key improvement of CellPolaris is the development of a transfer learning model to infer context specific GRNs by learning knowledge from large scale GRNs and transferring it to the context of interest with RNA‐Seq data. The model was trained using high‐confidence GRNs in diverse contexts generated from transcriptomic data and paired ATAC‐Seq data by PECA2 [16]. We utilized the critical domain training (CDT) method to mitigate shifts in TF to TG regulations across diverse sources, thereby enhancing the transfer of knowledge from known GRNs to new contexts. Benchmarking tests confirmed the reliability of CellPolaris in GRN construction. In addition, incorporating cross‐species GRNs, such as those from mice, yielded stable improvements in predicting human‐cell GRNs.
Based on the tissue‐specific or cell‐type specific GRNs, we design two downstream tasks, including predicting the master TFs governing cell fate transition and simulating the impacts of TF perturbations in developmental processes. In different reprograming scenarios, the top‐ranked TFs predicted by CellPolaris largely overlapped with the combinations of factors that successfully achieved reprogramming. For the task of perturbation simulation, instead of relying on predefined GRNs as used in previous software [24], we generate cell‐type‐specific GRNs and perform in silico TF knockouts using a Probabilistic Graphical Model (PGM) to predict changes in the expression of TGs. We validated the model through actual Rfx2 knockout experiments and demonstrated that CellPolaris exhibits partial advantages over the state‐of‐the‐art tool with similar function.
Results
2
Design and Organization of CellPolaris
2.1
To develop a unified framework for constructing GRNs with TFs as the cores, identifying master TFs controlling cell fate transitions, and simulating the effects of TF perturbations on cell status, we proposed the CellPolaris model. This model consisted of two core modules. A transfer learning model module that generates tissue‐specific or cell‐type‐specific GRNs, and a GRN‐dependent downstream task module to predict master TFs critical for cell identity and to simulate TF perturbation in developmental processes (Figure 1).
Schematic overview of CellPolaris design. A) Transfer learning model learns from a GRN database constructed by PECA2 tool from ATAC‐Seq and corresponding cell state RNA‐Seq data and transfers to tissue/cell type/state context of interest with RNA‐Seq data. This model enables cross‐species and cross‐tissue analysis. The trained model can take RNA‐Seq data as input and generate corresponding GRNs, which can be utilized for downstream applications. B) Prediction of cell fate regulatory factors based on GRN analysis: Comparison of GRN differences between source and target cells during cell fate transitions. Transcription factor in the differential networks were scored and ranked. C) Simulation of the impact of gene perturbation on cell differentiation pathways based on GRN analysis. A probability graphical model of gene regulation was constructed using GRN analysis. Simulated knockout genes were set to zero, and the expression changes of neighboring node genes were inferred to deduce changes in cell states.
GRN construction, however, faces two major challenges: the high cost of acquiring multi‐modal data beyond transcriptomics and the difficulty of leveraging existing knowledge from different tissues or time periods to construct GRNs for new contexts. To solve these issues, we constructed a high‐confidence GRN database centered on TFs using paired RNA‐Seq and ATAC‐Seq data. We then built a transfer learning model by integrating prior GRN knowledge from diverse cell states, allowing this model to infer the GRNs across different tissues and states using only bulk or single‐cell RNA‐Seq expression matrices as input (Figure 1A). Next, we used these inferred GRNs to perform two different downstream tasks: 1) Predicting the master TFs involved in cell fate transitions by extracting differential GRNs between cell states (Figure 1B). 2) Simulating in silico GRN‐based TF perturbations by building a PGM with regulatory weights (Figure 1C).
Constructing GRNs Through Transfer Learning
2.2
CellPolaris generalizes GRNs across different tissues and cellular states (Figure 2A; Figure S1, Supporting Information). As the basis of model training, high‐confidence GRNs were first produced using the PECA2 [16, 17], which employs paired RNA‐Seq and ATAC‐Seq data to infer GRNs (Figure S2, Supporting Information). Based on this multi‐modal dataset, we constructed GRNs for 88 mouse and 68 human bulk data sources from various tissues and developmental periods, as well as cell‐type‐specific GRNs for 40 mouse and 14 human single‐cell sources (Table S1, Supporting Information). Each source in the GRN database was considered a domain.
Critical domain training method. A) Overview of CDT: Using the RNA‐Seq data as input to generate GRN for each domain by CDT and output cell type specific GRN. B) Workflow of the training and inference of CDT method. Training Phase: Obtain a high confidence GRN resource using PECA2. And then identifying critical domain pairs dynamically by calculating domain similarity. Finally, optimize the GRN generator via critical domain training. Inference Phase: Apply the trained model to predict cell type specific GRNs using RNA‐Seq data from the target domain. C) Calculate domain similarity with Maximum Mean Discrepancy to dynamically choose the top‐σ ratio critical domain pairs and integrate regulations to generate generalized samples to emphasize optimizing them. D) Generating generalized regulations between critical domains. The left side displays the original regulations, whereas the right side showcases a combination of mixed samples and the original regulations. E) Technical illustration of regulation integration by using Mixup between gene regulations from two domains. F) The distribution before (large domain discrepancy exists between domains) and after training with CDT (domain discrepancy is reduced).
Leveraging these pre‐existing GRNs, we constructed a GRN generator (Figure 2A and Experimental Section) that infers the gene regulatory scores for all TF‐TG pairs. The gene expression level and TF‐TG regulatory relationships exhibit differences between different domains; this will result in feature distribution gaps between domain pairs. Next, we introduced a transfer learning model that can generalize the GRN generator to new unseen transcriptome data using a critical domain training (CDT) strategy (Figure 2B). The core concept of this strategy is to dynamically bridge the expression relationships between the least similar domains. The large distribution gaps between these domain pairs make acquiring invariant knowledge across diverse domains challenging. Consequently, we dynamically emphasized the optimization of these critical domains. By minimizing the distribution gap, our goal is to reduce overall domain discrepancies and consequently improve model generalization across diverse data distributions. We calculated the distance between any two domains in the feature space and selected the top percentage of the domains with the furthest distances as critical domains. These domains have large domain shifts that impede the learning of domain‐invariant knowledge, crucial for generalization (Figure 2C and Experimental Section). After identifying the critical domains, we used a Mixup strategy [25] for data augmentation to reduce the distance between critical domains (Figure 2D). Linear interpolation was performed on sample pairs from two different domains, thereby generating new samples that integrated knowledge from both domains and reduced the distribution gap (Figure 2E). After multiple rounds of the dynamic selection of critical domains, data augmentation, and generalization, the model finally converged (Figure 2F).
Performance of the GRN Transfer Model
2.3
To evaluate the performance of our generalized model, we trained it using GRNs in our database (Table S1, Supporting Information) and compared its performance with that of three popular domain generalization methods: Domain‐Adversarial Neural Networks (DANN), Mixup, and Ours‐MMD (maximum mean discrepancy with CDT) [25, 26, 27]. The DANN introduces an adversarial training objective to learn invariant knowledge across domains. Random Mixup involves making linear interpolations between random domain samples. Ours‐MMD adopts the widely used Maximum Mean Discrepancy (MMD) instead of mixing between critical domains. We employed the R^2^ score to assess the correlation between the TF‐TG regulatory strength values predicted by our model and those produced by PECA2. This evaluation helps determine the effectiveness of our regression fitting. Our model achieved a network correlation of approximately 95% (R^2^, see Experimental Section) when comparing the transferred GRNs with the high‐confidence GRNs available in the database. This comparison was conducted across different cell types using human and mouse single‐cell data (Figure 3A). Evaluation across various performance indicators, including the root mean square error (RMSE), mean absolute percentage error (MAPE), and area under the ROC curve (AUROC), demonstrates that our model performs slightly better than or comparably to the other three models (Figure 3A). We also compared an equal number of top‐ranked regulatory relationships predicted by our model with those generated by PECA2 for a specific single‐cell population. The comparison showed that approximately 67–77% of the regulatory relationships were consistent between the two methods (Figure 3B). Overall, these results demonstrated the effectiveness and generalizability of our approach for constructing tissue or cell‐type‐specific GRNs using transcriptome data alone.
Performance of CellPolaris. A) Cross‐cluster evaluation on human and mouse single‐cell RNA‐Seq data: Our model is trained using single‐cell‐derived GRN from PECA2 and compared with three state‐of‐the‐art domain generalization methods on both human and mouse single‐cell RNA‐Seq data. Each single‐cell cluster is treated as a distinct domain, leaving one cluster out as the target domain. The model is then trained on the remaining source tissues and directly tested on the target to evaluate its generalization performance. B) Comparison of GRN edges between PECA2 and model predictions: We conduct a thorough comparison between our model's predictions and PECA2's results for 4 human and mouse cells. The terms L2/3, IT (Intratelencephalic) neurons, and L4 are commonly used in neuroscience to classify specific layers and types of neurons in the mammalian brain; CD4 and CD8 are helper T cells and cytotoxic T cells, respectively. We compare the same number of top‐ranked regulatory relationships predicted by our model with the edges generated by PECA2. C‐D) Cross‐tissue evaluation: Similar to the cross‐cluster evaluation, we treat tissues as separate domains. The model is trained on several source tissues and evaluated on data from the leave‐one‐out tissue to assess its ability to generalize across tissues. E) Cross‐period evaluation: Similar to the cross‐tissue evaluation, we treat different developmental periods as separate domains. The model is trained on several source developmental periods and evaluated on data from the leave‐one‐out period to assess its ability to generalize across periods. F) Comparison results on the ChIP‐Atlas dataset: Performance (AUROC score) of CellPolaris compared to several other GRN inference methods, including CellOracle, GENIE3, SCENIC, DeepGRNCS and GRNBoost2, using the ChIP‐Atlas dataset.
Subsequently, we evaluated the accuracy of predicting GRNs across different tissues and periods using bulk RNA‐Seq data from mouse. We treated each tissue as a domain and used a leave‐one‐out setting. One tissue was held out as the target test domain, while the remaining non‐similar tissues used as the source domain for training. The model achieved a general performance of over 90% (AUROC) in predicting GRNs across different tissues (Figure 3C, D) and over 95% in predicting GRNs across different periods (Figure 3E).
In principle, our transfer model supports cross‐species transferring. To address cross‐species GRN transfer, we introduced an Extrapolative Mixup (extra_mixup) method based on the Mixup strategy (Experimental Section and Figure S3A, B, Supporting Information). Results showed that incorporating mouse heart‐derived GRNs markedly improved the ranking of heart‐specific TF‐TG regulatory relationships in the human heart GRNs constructed by CellPolaris, thereby identifying more target genes specifically regulated by heart‐expressed TFs (Figure S3C, D, Supporting Information). Moreover, we found that increasing the number of organs sourced from external species further enhances CellPolaris's overall performance without significantly increasing the proportion of false TF‐TG relationships (Experimental Section and Figure S3E–H, Supporting Information). In summary, our results indicate that CellPolaris is reliable for predicting tissue or cell‐type‐specific GRNs using transcriptomic data.
Finally, we evaluated the performance of our model‐generated GRNs by comparing them with those generated by existing GRN inference methods [9, 10, 11, 13, 16, 24, 28], including CellOracle, GENIE3, GRNBoost2, SCENIC, DeepGRNCS, SimiC, and ICAnet. We used RegNetwork, TRUUST, and ChIP‐Atlas databases [29, 30, 31] as the gold standard for TF‐TG regulatory pairs (Experimental Section). AUROC was chosen as the accuracy metric for the ChIP‐Atlas database, while the Recall Score was selected for both the RegNetwork and TRUUST databases (Experimental Section). Initially, we analyzed the target genes of 6 TFs across 4 cell types using the ChIP‐Atlas database and observed that CellPolaris achieves an average AUROC of 0.78 while other methods yield an average AUROC ranging from 0.54 to 0.59 (Figure 3F and Experimental Section). Furthermore, we evaluate a total of 16 TFs across 5 cell types to benchmark these methods using TRUUST and RegNetwork databases. The Recall Score also indicated that CellPolaris outperforms other methods (Figure S4, Supporting Information).
Prediction of Potential Master TFs During Cell Fate Transitions
2.4
The role of master TFs in cell fate transition is well established. To search for potential master TFs in reprogramming scenarios, we analyze the changes in GRN between source and target cell types rather than just considering the differential expression of TFs. We constructed tissue or cell‐type‐specific GRNs for source and target cells during reprogramming. Target cell‐type‐specific differential GRN was identified by filtering out the source GRN and our GRN database (Experimental Section). Next, we ranked the TFs from the differential GRN based on the following four indicators: their expression fold changes in the source and target cells, number of downstream target genes, weighted average fold changes of target genes, and eigenvector centrality of TFs, which represents the importance of target genes in the GRN structure and can be regarded as the score of the node when the information transfer throughout the graph reaches a steady state (Figure 4A). We validated our predictions using three previously reported reprogramming and transdifferentiation systems [20, 32]. Our results demonstrated that majority of combinations of reprogramming factors were included in the list of top‐ranked TFs (Figure 4B, C; Figure S5, Supporting Information). Most of the top 10‐ranked TFs in each cell fate transition system have been reported to enhance reprogramming efficiency or can be used for reprogramming in different combinations. TFs that have not been previously reported are often located in the same gene families with master TFs, suggesting functional compensation (Figure 4B, C; Figure S5, Supporting Information). Finally, several top‐ranked TFs potentially regulated more than 10% of the differentially expressed genes between source and target cells, indicating their importance in the reprogramming process (Figure 4D–F).
Prediction of master transcription factors in cell fate transitions. A) Flowchart of master transcription factor (TF) identification and ranking. When there are no corresponding cell‐state‐specific GRNs available, the GRN transfer generalization model is utilized to convert the transcriptome data of both source and target cells into cell‐type‐specific GRNs. Differential TFs were identified by filtering out GRNs between cell types and scored based on expression abundance, downstream node numbers, and the number of differentially covered genes in the downstream nodes. B) The model predicts the ranking of TF combinations known to be involved in MEF reprogramming into iPSCs. The predicted rank of each factor is shown in parentheses. The top 10 predicted TFs for MEF reprogramming into iPSCs by the model are shown, with red highlighting indicating TFs that have been used in fate conversion or validated to enhance reprogramming efficiency. C) The model predicts the ranking of TF combinations known to be involved in MEF transdifferentiation into cardiomyocytes. The predicted rank of each factor is shown in parentheses. The top 10 predicted TFs for MEF transdifferentiation into cardiomyocytes by the model are shown, with red highlighting indicating TFs that have been used in fate conversion or validated to enhance reprogramming efficiency. D‐F) Statistical information on the model's prediction of the top 10 master TFs regulating target genes in each cell fate transition system, as well as the percentage of differentially expressed target genes between the two cell states out of all differentially expressed genes.
Overall, we demonstrated the applicability of our strategy for searching master TFs in some reprogramming systems, which can provide a list of candidate master TFs that enhance cell fate reprogramming.
Simulating TF Perturbations During the Process of Round Spermatid Differentiation
2.5
In addition to predicting master TFs, GRNs can be utilized to predict the effects of TF perturbations during developmental processes. The effects of TF perturbations during differentiation can be simulated using the vector field of single‐cells. Therefore, we constructed a probabilistic graphical model (PGM) to simulate TF perturbations at the single‐cell level. First, the scRNA‐Seq data were subjected to cluster analysis, and the cells in each cluster were transformed into pseudo‐bulk data for the transfer of cluster‐specific GRNs. The resulting GRNs are integrated with scRNA‐Seq data to construct a PGM by learning conditional distribution parameters between genes considering the overall network structure. Based on this model, we predicted the changes of downstream TGs by setting the expression of TF to zero (Figure 5A). Finally, we estimated the effect of TF perturbations on cell developmental processes by adopting a strategy similar to CellOracle [24].
Simulation of the impact of transcription factor perturbation on cell differentiation progression. A) Flowchart of the construction process for the probability graphical model (PGM) used to simulate transcription factor perturbation. Single‐cell expression matrix was sampled to estimate the expression value distribution of each gene in the same cell cluster. The expression matrix of the same cell cluster was converted to pseudo‐bulk data and used to predict the GRN with regulatory weights through the generalized transfer model. The PGM of gene expression was constructed by combining GRN and gene expression values, and parameter learning was performed. The downstream target gene expression changes were inferred through the model by setting the TF expression value to zero. B) Diagram illustrating the process of round spermatid differentiation. RS2 represents steps 1‐2 spermatids, RS4 represents steps 3‐4 spermatids, RS6 represents steps 5‐6 spermatids, and RS8 represents steps 7‐8 spermatids. C) t‐SNE visualization of scRNA‐Seq data from different stages of round spermatid differentiation, with arrows indicating the direction of differentiation. D) Feature Plots depicting the relative expression level of the Crem gene during the process of round spermatid differentiation. E) Simulation of the impact of Crem gene knockout on the development of round spermatids by CellPolaris. Arrows pointing in the direction of differentiation indicate promotion, while arrows in the opposite direction indicate inhibition. F) Feature Plots depicting the relative expression level of the Hoxa4 gene during the process of round spermatid differentiation. G) Simulation of the impact of Hoxa4 gene knockout on the development of round spermatids by CellPolaris. Arrows pointing in the direction of differentiation indicate promotion, while arrows in the opposite direction indicate inhibition. H) List of top 10 TFs that influence the differentiation process in round spermatids after gene knockout simulation at each stage by CellPolaris.
By reanalyzing the scRNA‐Seq data of round spermatid differentiation [33], we constructed a differentiation trajectory (Figure 5B, C). We measured the effects of two transcription factors, Crem and Hoxa4, that had been reported to regulate round spermatid differentiation [34, 35, 36]. Consistent with the reported results, knocking out both TFs reversed the differentiation trajectory of round spermatids and inhibited their differentiation (Figure 5D–G). Our prediction results were consistent with the results of the CellOracle linear regression model (Figure S6A, B; Supporting Information). Subsequently, we identified TFs involved in the differentiation of round spermatids, most of which were important in all three stages (Figure 5H and Experimental Section). Of note, the absence of many of these genes results in early developmental arrest in animal models. Hence, our model is useful for guiding the study of gene functions in late‐stage developmental processes.
In summary, to simulate TF perturbations, we established a PGM by integrating transferred GRNs and single‐cell RNA‐Seq data. The performance of this model was evaluated by simulating TF perturbations in round spermatid differentiation. Predicted results were consistent with those of the genetic knockout animal models.
Simulating and Validating the Effects of Rfx2 Knocking Out on Round Spermatid Development
2.6
Next, we used CellPolaris to simulate the knockout of Rfx2, a gene known to be crucial for round spermatid development [37, 38]. The prediction results revealed that the knockout of Rfx2 resulted in the early blockage of round spermatid development, which is consistent with the predictions from CellOracle (Figure 6A, B; Figure S7A, B, Supporting Information). To comprehensively demonstrate the comparison between our predictions and those from CellOracle, we developed the mouse model with Rfx2 deficiency in the round spermatid development stage. In line with previous reports, knockout of Rfx2 resulted in abnormal spermatogenesis in mice (Figure 6C, D). To evaluate the developmental arrest period of spermatids, we used Peanut Agglutinin (PNA), a labeling technique that distinguishes between the various stages of sperm development. The arc length of PNA staining gradually increased as round spermatids advanced, providing a measure of their progression during development. Notably, the results showed that Rfx2 knockout led to a punctate pattern of PNA distribution, supporting the prediction that Rfx2 knockout causes a blockade in the early stages of round spermatid development (Figure 6E; Figure S7C–E, Supporting Information).
Knocking out Rfx2 leads to the developmental arrest of round spermatid. A) Feature Plots depicting the relative expression level of the Rfx2 gene during the process of round spermatid differentiation. B) Simulation of the impact of Rfx2 gene knockout on the development of round spermatids by CellPolaris. Arrows pointing in the direction of differentiation indicate promotion, while arrows in the opposite direction indicate inhibition. C) Comparison of testis size between wild‐type and Rfx2 knockout mice. Scale bar: 1 cm. D) Histological analysis of testis sections from Rfx2 knockout and wild‐type mice. Scale bar: 50 µm. E) Immunofluorescence staining in testis for Peanut agglutinin (PNA), a specific marker of acrosome. Scale bar: 50 µm. F) Volcano plot depicting differentially expressed genes in round spermatids resulting from Rfx2 knockout. G) The distribution of Rfx2 regulated target genes predicted by CellPolaris and CellOracle in the knockout dataset. Red color indicates consistent direction of predicted upregulation or downregulation compared to the actual data. H) The distribution of genes exhibiting a fold change of two or more in the knockout data overlaps with gene predictions from CellPolaris and CellOracle, respectively.
In a more detailed analysis, we examined the downstream TGs of Rfx2 predicted by CellPolaris during round spermatid development and identified 4 negatively and 26 positively TGs by Rfx2 (Figure S7B, Supporting Information). Consistent with these predictions, among the positively regulated TGs, seven were downregulated, with > 2‐fold changes in the Rfx2 knockout transcriptome data (hypergeometric test, p‐value = 0.0063, Experimental Section) (Figure 6F). Moreover, of the four negatively regulated genes, three were upregulated in the Rfx2 knockout transcriptome. Additionally, 65% (17/26) of the positively regulated genes displayed a downward trend in knockout samples. In comparison, when examining the TGs predicted by CellOracle for both positive and negative regulation, the accuracy of the predictions was lower than that of CellPolaris (Figure 6G). In particular, among the genes that showed a fold‐change of two or more upregulations in the Rfx2 knockout samples, CellOracle predicted six genes to be downregulated. Among the genes that exhibited a fold change of two or more downregulations in the Rfx2 knockout samples, CellOracle predicted six genes to be positively regulated by Rfx2 (Figure 6H).
Discussion
3
Deciphering the roles of TFs in developmental processes is a highly challenging task. In this study, we present CellPolaris, a unified framework designed to perform TF‐centered GRN construction, master TF identification, and TF perturbation simulation for developmental processes. For GRN construction, a transfer learning model that generates tissue or cell‐type‐specific GRNs from RNA‐Seq data by leveraging existing high‐confidence GRNs in training process. Compared with existing transcriptome‐only software such as GENIE3 and GRNBoost2 [9, 28], CellPolaris demonstrated a certain degree of performance improvement. Furthermore, compared with other tools such as PECA2, SCENIC+, and DeepTFni, CellPolaris does not rely on additional information beyond transcriptomic data for GRN inference once the model is trained [16, 18, 19, 24].
Several foundational models based on single‐cell data have been developed, including GeneCompass, scGPT, Geneformer, and scFoundation [39, 40, 41, 42, 43, 44, 45, 46]. The GRNs generated by CellPolaris can serve as prior biological knowledge for the pretraining process of these large models, for example, by imposing soft constraints. Meanwhile, one advantage of our framework is its scalability, which allows the high‐confidence GRNs training set to be expanded to enhance the performance of CellPolaris.
For the perturbation prediction task based on probabilistic graphical models, our approach, compared with CellOracle [24], learns parameters across the entire network rather than just parts of it. This method fully leverages the integrity of the network and uses the gene expression at single‐cell level. Recent studies, such as those using Spectra [47] and expiMap [48], have demonstrated the feasibility of perturbation prediction at the gene program level. Integrating these approaches with our model could potentially enhance the accuracy of predictions regarding changes in TG expression.
Future enhancements are also necessary for CellPolaris. While the transfer learning approach is effective for GRN construction, it could be strengthened by extending its applicability beyond human and mouse, particularly to distantly related species where cross‐species GRN transfer remains challenging. In addition to RNA‐Seq, other modalities such as ATAC‐Seq are becoming increasingly accessible. Therefore, supporting flexible input modalities for GRN transfer learning will be essential in the future. Currently, the module for predicting master TFs only ranks factors associated with cell fate transitions. Future updates could give combinations of TFs that more effectively influence these transitions. Moreover, the perturbation simulation module is constrained by the node size limits of the PGM, which reduces computational efficiency as the complexity of GRNs increases. Implementing parallel computing strategies may provide a viable solution to enhance performance in subsequent developments.
In conclusion, we developed CellPolaris, a framework designed to elucidate the roles of TFs in developmental processes using transcriptomic data. This framework offers valuable insights into the mechanisms underlying cell fate regulation and development, paving the way for future advancements in GRN analysis and related fields.
Experimental Section
4
Animal Models
4.1
The Rfx2 knockout mouse line was used as previously described [37]. All animal experiments were conducted in accordance with the guidelines of the Animal Care and Use Committee of the Center for Excellence in Molecular Cell Science (Shanghai Institute of Biochemistry and Cell Biology) at the Chinese Academy of Sciences.
Histology and Immunohistochemical Analysis
4.2
Testes were fixed in Bouin's solution or 4% paraformaldehyde (PFA), embedded in paraffin, and sectioned. For testicular histology, the sections were deparaffinized, rehydrated, and stained with hematoxylin and eosin (H&E). For immunohistochemical analysis, the sections were retrieved in 10 mM sodium citrate buffer (pH 6.0) by boiling in a microwave and further washed in PBS with 0.1% Triton X‐100. The sections were blocked with blocking buffer (10% donkey serum and 0.1% Triton X‐100 in PBS) and then incubated with the primary antibodies overnight at 4 °C followed by Alexa Fluor 488‐conjugated donkey secondary antibody (1:500, Jackson ImmunoResearch Laboratories) for 1 h at 25 °C. Fluorescent sections were mounted with ProLong Gold Antifade medium containing DAPI (Molecular Probes) and analyzed using fluorescence microscopy (Zeiss Axio). The following primary antibodies were used: rabbit anti‐RFX2 (1:100, PA5‐61850, Invitrogen) and rhodamine‐conjugated peanut agglutinin (1:1000, RL‐1072, Vector Laboratories).
Synchronous Spermatogenesis
4.3
Control and Rfx2 KO mice were used for synchronous spermatogenesis to generate Step1‐2 round spermatids. Spermatogenesis was synchronized as previously described, with some modifications [33]. Briefly, 2‐dpp mice were pipette fed 100 µg/g body weight WIN 18,446 (MP Biomedicals) and suspended in 1% gum tragacanth (aladdin) for 7 consecutive days. On day 8 of WIN 18,446 treatments, these animals received an i.p. injection of RA (25 µg/g body weight) (Sigma) in dimethyl sulfoxide (DMSO), and were then left to recover for sample collections at the Step1‐2 round spermatid stage. Testicular tissues were collected and analyzed for synchronous efficiency using histological analysis and PNA staining.
Isolation of Round Spermatids
4.4
Round spermatids were isolated from spermatogenesis‐synchronous control and Rfx2 KO mice, as described above. Cells were collected using Hoechst 33342/propidium iodide–based fluorescence‐activated cell sorting (FACS) as described previously [49]. Briefly, the testes were collected in PBS and placed on ice. After removal of the tunica albuginea, the testes were incubated in PBS containing 120 units/ml collagenase type I (Worthington) at 34 °C with gentle agitation for 5 min, digested with 0.25% trypsin (Gibco) and 5 mg/ml DNase I (Sigma), and terminated by adding fetal bovine serum (Gibco). The dissociated testicular cell suspension was collected and resuspended at a concentration of 1 × 10^6^ cells/ml in Dulbecco's modified DMEM medium (HyClone) containing 0.1 mg/ml DNase. The testicular cell suspension was then stained with Hoechst 33342 at a final concentration of 6 µg/ml for 30 min and stained with propidium iodide prior to sorting immediately. Round spermatids were sorted using flow cytometry (AriaSORP; BD Biosciences). PNA staining was used to analyze the purity of the cells.
RNA‐Sequencing and Data Processing
4.5
Total RNA was isolated from control and Rfx2 KO step1‐2 round spermatids using TRIzol reagent (Life Technologies, 15596‐018). The RNA integrity was assessed using an Agilent 2100 bioanalyzer. Total RNA (100 ng total RNA) was used for the library construction. rRNA was removed, and cDNA libraries were constructed using the TrueSeq Stranded Total RNA Library Prep Kit (Illumina) following the manufacturer's instructions. The libraries were sequenced on an Illumina HiSeq 2500 platform with 150 bp pair‐end reads. Raw RNA‐Seq reads were subjected to removal of adapter contamination and low‐quality sequences using trim_galore (version: 0.6.6) in the paired‐end mode; trimmed paired‐end reads shorter than 30 were discarded. Trimmed RNA‐Seq data were aligned to the mouse genome build mm10 using star (version: 2.7.11a) with default parameters [50], and mapped RNA‐Seq reads in SAM files were converted to BAM files using a home‐made script. Mapped reads on transcripts of mm10 Refseq annotation were counted using stringtie [51] under the parameters: ‘ ‐B ‐e ‐p 16’ with bam files and converted to count values from FPKM values using prepDE.py. Expression difference between groups was calculated using t‐test. Differentially expressed genes were identified under the parameters: Log2(Fold change) >= 1 or < = −1, p‐value <= 0.05. FPKM values of mm10 Refseq genes smaller than 0.5 were regarded as repressed genes. Volcano plots for differentially expressed genes between adjacent stages were generated with ggplot2 R package.
Single‐Cell RNA‐Seq Data Preprocessing
4.6
We collected single‐cell data from the literature and obtained raw expression matrices from the NCBI Gene Expression Omnibus (GEO), including GSE156125 [52], GSE212482 [53], and GSE107644 [33]. Scanpy was employed for the preprocessing and visualization of a count matrix of single‐cell data [54]. First, we applied scanpy.pp.normalize_total (adata) to linearly scale each cell based on the total UMI count to 10 000. Next, we utilized scanpy.pp.log1p (adata) to apply a log1p transformation to the expression values. Subsequently, we identified 2000 highly variable genes by employing scanpy.pp.highly_variable_genes (adata, n_top_genes = 2000). Further dimensionality reduction techniques, such as t‐SNE and UMAP, were employed, followed by clustering on the expression matrix for subsequent downstream analysis.
To generate pseudo‐bulk RNA‐Seq data from single‐cell RNA‐Seq data, the following procedure was used: Initially, the single‐cell expression count matrix was segmented into multiple submatrices based on the cell types identified through clustering. Subsequently, add the count of cells within the same cell type as pseudo‐bulk and calculate the Fragments Per Kilobase Million (FPKM) values at the pseudo‐bulk level, which can be obtained using Equation (1):
where ei∈R is the FPKM value of i‐th gene, count_ i _ is the count of i‐th gene at pseudo‐bulk level and *l_i_
- is the length of i‐th gene. FPKM is adopted as gene expression value.
PECA2 GRN Construction
4.7
PECA2 [16] method uses paired expression (RNA‐Seq) and chromatin accessibility data (ATAC‐Seq) as inputs to generate a gene regulatory network. We downloaded paired RNA‐Seq and ATAC‐Seq data from the Encyclopedia of DNA Elements (ENCODE) [55, 56] and generated 88 mouse PECA2 networks and 68 human PECA2 networks. Meanwhile, we generated 40 mouse and 14 human cell type‐level PECA2 networks using single‐cell multiomics data [57, 58, 59, 60, 61] by synthesizing cells of the same cell type using a pseudo‐bulk label. The output of PECA2 includes TF‐TG pairs with regulatory relationships and Trans‐Regulation Score (TRS), which measures regulatory intensity. These PECA2 networks constituted the network base to train our transfer learning model.
For experiments across cell types, we randomly partitioned PECA2 networks into four folds for cross‐validation, designating onefold as the test set. Similarly, for tissue and period evaluations, we applied a strategy similar to leave‐one‐out, using one tissue/state's data as the test set. In both cases, the rest is used for training and validation in an 8:2 ratio.
GRN Generator for New Cellular Contexts
4.8
To predict the regulatory relationship between TFs and TGs in new cellular contexts, we first proposed a GRN generator to infer the GRN utilizing gene expression data. The GRN generator adopted the neural collaborative filtering (NCF) method [62] to capture the complex regulatory relationship between TFs and TGs in a high‐order space. It first embeds each TF and TG as dense vectors, which are then fed into neural networks to learn high‐level nonlinear regulatory relationships. The regulatory scores prediction function is formulated as follows (Equation 2):
where y^ij∈R is the predicted regulatory score for i‐th TF and j‐th TG. xiTF,xjTG are embeddings for i‐th TF and j‐th TG, respectively. Each embedding consists of two components: the gene name embedding and the expression value, where xiTF=[zTFi|eTFi], xjTG=[zTGj|eTGj]. zTFi, zTGj∈Rd denote gene name embeddings, and eTFi, eTGj represent the FPKM values (Equation 1), [· | ·] indicates the concatenate function. fM is the NCF model [62], which maps TF and TG embeddings to their regulatory scores through the integration of a multi‐layer perceptron and matrix factorization [63, 64]. *f_c_
- is the regressor which output the regulatory score, and *f_e_
- is the feature interaction module. *f_e_
- mainly consists of two pathway: a matrix‐factorization (MF) pathway and a multi‐layer perceptron (MLP) pathway to extract complementary interaction features. The MF pathway captures the linear latent compatibility between TF and TG embeddings through an element‐wise product:
And a multi‐layer perceptron (MLP) pathway, which models high‐order, nonlinear TF–TG regulatory dependencies:
These two interaction features are fused by concatenation:
and passed to the regressor network *f_c_
- to generate the final regulatory score as Equation (2) formulated.
Critical Domain Training
4.9
To realize generalized GRN generation, the crux of our methodology lies in bridging the gap between different domains with the objective of expanding the diversity of the data space, particularly in the case of distant domains, which we refer to as the worst cases. We define the domain pairs with the maximum divergence as critical domain pairs. By enhancing these critical domains, the model emphasizes acquiring knowledge that is difficult to learn. This strategy led to a smooth and effective learning process. Both the robustness and generalizability of the model are enhanced. The domain similarity is calculated dynamically during training, leading to dynamic modifications of the critical domains.
We first calculated the distance (the larger the distance, the smaller the similarity) between training domain pairs by utilizing the Maximum Mean Discrepancy (MMD) [65], which is a widely used distance metric in transfer learning. The distance between two domains Ds1 and Ds2 is presented in Equation (6):
where D(·, ·) is the distance between two domains, N _ s1_ and N _ s2_ are cardinalities of these two domains. *f_h_
- maps the original instances to the Reproducing Kernel Hilbert Space H. After calculating the distances between all pairs of domains, we arranged them in descending order. When the distance of any pair of domains falls within the threshold of the top‐σ ratio, we consider them to be significantly apart. These pairs are referred to as critical‐domain pairs. σ is a pre‐defined hyperparameter.
To smoothly reduce the distribution discrepancy between the two domains, we proposed the use of an alternative data augmentation technique, Mixup [25], to smoothly blur the boundaries of the domains. Mixup is a simple but effective data‐augmentation method that performs linear interpolation on random pairs of samples. We extend Mixup to a GRN construction scenario. Specifically, given two samples (xiTF,xjTG,yij)∈Ds1, (xmTF,xnTG,ymn)∈Ds2, the augmented samples (xTF∼,xTG∼,y∼) could be calculated by the following Equation (7):
where λ ∼ Beta(α, α), α > 0 is a hyperparameter. The optimization objective is to minimize the loss of the original and augmented data generated between the critical domain pairs using Equation (8).
where the basic loss function *l_reg_
- is the MSE loss, and Dori and Dmix are the domains of the original and augmented data, respectively. N is the total number of source domains. The function max top means to select top‐σ domain pair (critical domain pairs) from all pairs.
By applying the Mixup method to the critical domain pairs, we can generate new samples that lie between the two domains based on information from the two domains. These samples integrate the common and specific information from both domains, filling the gap between the two domains. At the same time, samples from the original domain pairs also participate in model training, preserving the initial nature of cells for shaping identification. From a specific perspective, the knowledge being transferred is the regulatory patterns embedded in the GRNs, while the dynamic mixup process serves as the mechanism for extending these patterns across tissues/conditions with different distributions. Therefore, the model's generalization capability is enhanced.
Utilizing Cross‐Species Regulatory Relationships For Enhancing Generalization
4.10
We have developed an enhanced transfer learning strategy called “Extrapolative Mixup” (extra_mixup) that extends our original method for cross‐species scenarios. Since the differences in gene regulatory relationships across species are relatively large, direct and indiscriminate transfer may fail to capture tissue‐specific knowledge. However, the same tissues across different species (e.g., the human and mouse heart) exhibit clear, strong correlations. Inspired by this observation, we design a method capable of performing reasonable extrapolation within the distribution range of existing data. Take human heart GRN reconstruction as an example, we use mouse heart GRN as the strongly related network. We use the paired human kidney and mouse kidney as well as the strongly related mouse heart GRNs as the source domain. Here, paired indicates that these tissues/conditions are biologically and partially conserved gene regulatory mechanisms. This pairing ensures a higher degree of similarity in the underlying data distribution, thereby allowing us to leverage cross‐species commonalities in regulatory patterns. To establish cross‐species correspondence, we retrieved homologous genes between human and mouse from Ensembl BioMart and mapped them to a unified representation.
We design a two‐stage framework to enhance the cross‐species generalization for specific tissue (Figure S3A, Supporting Information). In stage‐1, we use the vanilla mixup to get an intermediate domain between the human kidney and mouse heart GRNs, whose data distribution lies between the source human kidney and mouse heart. In stage‐2, we extrapolate along the direction from the mouse kidney Ds1 to the intermediate domain Ds2, and further generalize to the human heart, this process can be formalized as follows.
Specifically, given two samples (xiTF,xjTG,yij)∈Ds1, (xmTF,xnTG,ymn)∈Ds2, the augmented samples (xTF∼,xTG∼,y∼) could be calculated by the following equation:
where λ ∼ Beta(α, α), λ¯=λ+1, α > 0 is a hyperparameter as defined in the vanilla mixup. The paired human–mouse kidney serves as a biologically grounded anchor, enabling us to construct an intermediate distribution that captures cross‐species regulatory similarities. Building upon this aligned domain, we then perform outward extrapolation to reach a more distant target, the human heart GRN. Within a biologically reasonable similarity range, increasing the amount of paired auxiliary knowledge allows the extrapolated distribution to more broadly and effectively cover the target specific distribution, namely the human heart (Figure S3B, Supporting Information).
Ground Truth TF‐TG regulations and GRN Benchmark
4.11
To verify whether the transfer learning network could accurately predict cell‐type‐specific GRNs, we conducted benchmark tests by comparing it with several known gene regulatory network inference methods, including PECA2 [16], CellOracle [24], GENIE3 [28], GRNBoost2 [8], SCENIC [9], SimiC [13], DeepGRNCS [11], and ICAnet [10]. Those are representative methods based on diverse models and algorithms.
The datasets used for the benchmark are listed below: GSE169262, GSE107644, GSE151309, and the PBMC dataset from the Seurat pipeline.
For the ground truth TF‐TG regulations, we utilized the ChIP‐Atlas database [29]. For each specific cell type or tissue, we downloaded the TF's ChIP‐Seq data in bed format from the ChIP‐Atlas datasets. We selected TF's binding peaks present in the ChIP‐Seq data. The corresponding TF‐TG scores were downloaded from ChIP‐Atlas and converted into a binary network, where each edge weight was either 1 or 0, indicating the presence or absence of TF's binding between genes. The conversion process involved the following steps: (1) selecting the top 100 genes based on the binding scores of MACS2 and STRING for each TF as positive edges in the binary network and (2) deleting all genes with non‐zero binding scores from the target gene list for each TF. This process results in a gene regulatory network specific to the cell type or tissue of interest. For our benchmark tests, we selected 6 regulatory factors across four cell types. The network inference performance was evaluated using the area under the curve (AUROC) metric.
We also used two additional TF‐TG regulation databases, RegNetwork [30] and TRUUST [31], for benchmarking. Initially, target gene lists for each transcription factor were downloaded from the two databases, and the downstream genes of the corresponding TF were identified in the GRNs generated by each method, to calculate the recall score of the target genes. Recall, also referred to as sensitivity or true positive rate, is a metric in machine learning used to assess a classification model's performance. The Recall Score here is defined as the ratio of predicted target genes of a specific TF to all target genes of that TF in the database.
Evaluation of the Extra_Mixup Method
4.12
We validated the extra_mixup method through multiple independent assessments. First, we examined how well each method recovers heart‐specific regulatory edges identified by PECA2 (Figure S3C, Supporting Information). Among 30,000 human heart‐specific edges (relative to other tissues), extra_mixup consistently ranked these tissue‐specific regulations higher than the baseline approach (the original mixup method of CellPolaris), demonstrating improved capture of tissue‐specific regulatory logic through cross‐species transfer. Second, we validated predicted regulations using independent ChIP‐Seq data from literature and ChIP‐Atlas (Figure S3D, Supporting Information). For key cardiac TFs, we examined how many of the top 100 predicted targets in our generated networks were confirmed by ChIP‐Seq binding. The extra_mixup approach showed higher validation rates compared to baseline, confirming that cross‐species knowledge transfer improves prediction accuracy.
Furthermore, we investigated how the number of source domains affects extra_mixup performance (Figure S3 E–H, Supporting Information). As adding more source domains introduces additional regulatory relationships, we analyzed whether true edges benefit more than false edges from this expanded knowledge base. Our results show that increasing source domains disproportionately improves scores for true regulatory relationships compared to spurious ones, indicating that the method effectively leverages conserved regulatory patterns while filtering noise. This demonstrates that CellPolaris can intelligently integrate knowledge from multiple divergent systems to improve target network reconstruction.
Prediction of Potential Master TFs
4.13
Assuming that there are two gene regulatory networks N 1 and N 2 (PECA2 network or transfer learning network), the master TFs of N 2 relative to N 1 are calculated as follows: (1) Use the generated PECA2 network (84 mouse networks or 76 human networks) as the background. If a TF‐TG regulatory relationship (edge) occurs more than ten times, the edge is considered as a common edge. (2) Retain edges that are unique to N 2 versus N 1 or have more than twice the TRS score of N 1 and delete the common edges in (1) as N 3. The top 5000 regulatory relationships in N 3 ranked by the TRS score as N 2's differential network. (3) Calculate four scores and then rank the TFs in the differential network: expression fold change in the source and target cells, the number of downstream target genes, weighted average fold changes of target genes by TRS score in the differential network, and eigenvector centrality of TFs, which represents the importance of its target genes in the GRN structure and is calculated as follows (Equation 10) using python package Networkx [66]:
The matrix A represents the adjacency matrix of the differential network, λ^1^ represents the largest eigenvalue of A, * x
- is the eigenvector corresponding to λ^1^, and the value of each position represents the eigenvector centrality of the corresponding node. (4) By averaging the four ranks, we obtained the integrated rank of the master TFs.
PGM Method Overview
4.14
We propose a probabilistic graphical model (PGM) to further combine our GRN transfer learning procedure with single‐cell RNA‐Seq data (X) for specific contexts of interest. The output a set of weighted GRNs, which are the GRNs corresponding to clusters from single‐cell data X and used to simulate TF's knockout effect at single‐cell level. The PGM method consists of the following steps: (1) Construct GRNs (G) using pseudo‐bulk RNA‐Seq data for the cell clusters obtained from scRNA‐Seq data X at the cell type level. G provides a set of TF‐TG regulatory relationships and also TRS score as their regulatory strength. (2) Calculate each cell's neighbors from scRNA‐Seq data X and merge them into metacells. This will provide the high dimensional but not‐sparse expression vector of TFs and TGs. (3) Construct a set of refined and weighted GRNs (G’) by PGM to obtain the conditional distribution between the variables. (4) Perform in silico TF knockout predictions in scRNA‐Seq data X to compute the direction of cell fate changes.
Metacell Conversion
4.15
We assume scRNA‐Seq data X of the specific cellular contexts of interest are with N cells. We introduce the notion of metacells for inferring GRNs in PGM. A metacell is, in theory, an imputed scRNA‐eq cell profile that are statistically equivalent to samples derived from the same RNA pool. Metacell profiles should therefore be distributed multinomially with predictable variance per gene and near zero gene‐gene covariance. In practice we used the top 10 dimensions of principal components (PCs) to calculate the distance among cells. For each cell, we combined N10∼N5 closest cells together to merge into a metacell. This way we converted scRNA‐Seq data X to expression matrix at the metacell level as X′, which allow us to assume all gene expression FPKM values to follow a joint Gaussian distribution at the resolution of the metacell.
Integrate G and X’ by PGM
4.16
Network preprocessing and node grouping. Without loss of generality, we assumed the GRN to be a connected graph. We first examined the graph for the presence of directed cycles. If such cycles exist, we proceed to remove the edge with the smallest TRS value from the cycle. According to the TF‐TG regulatory structure in our GRN, we grouped the nodes into two sets: (1) top TFs, whose indegree is 0, recorded as S 1; (2) TGs (contain regulated TFs), recorded as S 2.
Multivariate Gaussian distribution of genes. Assume that the expression of gene_ i _ can be recorded as a random variable *y_i_
- and all the genes follows joint Gaussian distribution Y∼N(m,Σ) at the metacell level, where Y = (y 1,y 2,..., *y_n_ *)^ T ^, n is the number of genes, * m *, Σ are learnable parameters of mean and covariance matrix. To simplify computations and reduce the number of parameters, we set the covariance between two nodes in the network that are not connected by a directed edge to zero.
Conditional distribution for genes in * S * 2. For the gene in S 2, we model *TG_j_ *'s conditional mean uj|Paj and variance σj|Paj2 using the properties of joint Gaussian distribution as follows, respectively (Equations 11 and 12),
where *Pa_j_
- represents the set of TFs who regulate *TG_j_ *, *w_ij_
- is a linear coefficient reflecting the strength of the relationship between *y_i_
- and *y_j_ *. Σj,Paj,ΣPaj,Paj,ΣPaj,j are the submatrices of Σ corresponding to the subscripts.
Likelihood function. Let {θ} denote the set of parameters which consists of conditional mean parameter {*w_ij_ }, the mean ( m *) and covariance matrix (Σ) of joint Gaussian distribution. The purpose of constructing these conditional probabilities is to maximize the probability of generating X′ from the network, thus the likelihood function is defined as follows:
After distinguish between S 1 and S 2 through different probability forms, the objective function can be written as:
where P(*TF_i_ *) can be deduced by fact that the expression of *TF_i_
- follows the normal distribution N(*m_i_ *, Σ_ ii ) and P(*TG_j *|*Pa_j_ *) is from the expression of *TG_j_
- given *Pa_j_
- follows the normal distribution N(uj|Paj,σj|Paj2).
Combining further information in G and X’. In constructing likelihood function, we used the GRN structure in G as a set of TF‐TG regulatory relationships. We further integrate the regulatory strength TRS score. The TRS score contains the expression information of TF and TG, and the regulation information at the epigenetics level in PECA2 method. The higher the TRS score, the stronger the influence of a TF on TG expression for the TF‐TG pair, implying a larger parameter *w_ij_ *. Meanwhile, the sample covariance matrix (S) can be calculated from scRNA‐Seq data X’ to constrain the parameter estimation of Σ. To combine those additional information in G and X’ in estimating {θ}, we used Bayesian formula to combine the likelihood function P(X′|G, θ) and constraints on parameters P(θ) as follows:
In practice, we incorporate TRS scores and S to restrict {*w_ij_ *} and Σ and propose h(θ) to substitute logP(θ|G) as follows,
where *E_G_
- is the edge set in G. By default, we set ε to 10^−10^.
The final objective function is
Parameter estimation. We constrain the range of the parameter *m_i_
- to [0.5misample,1.5misample], and σi2 to [0.5σi2,sample,1.5σi2,sample], where misample and σi2,sample are the calculated mean and variance of *gene_i_
- from X . We use stochastic gradient descent (SGD) to minimize the objective function, and the default number of iterations is 100.
In Silico KO Prediction
4.17
Next, we use gene‐regulatory networks inferred PGM by combining GRNs (G) with single‐cell RNA‐Seq data (X) to perform in silico transcription factor perturbations, simulating the consequent changes in cell fate using only unperturbed wild‐type data. At single‐cell cluster pseudo‐bulk level, we input a raw RNA expression vector and output a predicted expression vector after knocking out specific TF. Suppose knocking out *TF_i_ *, to highlight the relationship between *TF_i_
- and *TG_j_
- (only consider *w_ij_
- not 0), we computed the marginal conditional distribution of *TG_j_
- given *TF_i_ *. *TG_j_ *′s expression value can be estimated as follows:
which is the conditional mean of P(*TG_j_ *|*TF_i_
- = 0).
We further extend the pseudo‐bulk level estimation to single‐cell level by assigning each cell the estimated values from its belonging cluster. After traversing all the downstream of *TF_i_ *, we obtained the first‐round predicted gene expression values. *TF_i_
- and its downstream TGs were changed and recorded as set V 1. We can then make predictions about *TF_i_ *’s second‐order neighbors. For a TG, which is *TF_i_ *’s second‐order neighbor, we use the changed value for TFs in V 1 to get the TG's predicted value. For the cell fate conversion task, we calculated the expression of each gene after knockout. The original expression vector was recorded as e 1, the predicted expression vector as e 2, and the other cell type (c) expression vector was recorded as *e_c_ *. Measured by cosine similarity, the cell state shift after knocking out a TF is determined by Equation (19):
The higher the cosine value, the more likely the cell state will transition to cell type c following a TF knockout. The signal propagation within the GRN, estimation of transition probabilities, and analysis of simulated transition in cell identity follows the procedure in CellOracle.
Statistical Testing of the Accuracy of TGs Expression Changes
4.18
During the development of round spermatids, 26 genes identified by the algorithm as positively regulated by Rfx2 were experimentally found to be downregulated after Rfx2 knockout. Additionally, the 7 genes labeled as ′verified′ were downregulated with a more than twofold change following Rfx2 knockout, and these changes in expression are consistent with the results calculated by our method.
Next, we use hypergeometric tests to demonstrate that the 7 out of 26 genes identified are significant. In the knockout experiment, there were a total of 5741 genes with reduced expression, of which 512 genes decreased with more than twofold. We use phyper function in R to do this test, phyper (7‐1, 512, 5741‐512, 26, lower.tail = F).
To compare with CellOracle, we randomly sampled 80% of the cells 30 times and calculated the results using both methods separately. We do t‐test to demonstrate that the proportion obtained by our method is significantly higher than that of CellOracle.
Simulate and Draw the Cell Differentiation Trajectory After Knocking Out a Specific TF
4.19
To simulate cell identity, we developed our code by modifying the CellOracle [24]. Using the PGM, we obtained the change in gene expression of TGs after knocking out the TF. We calculated the differences in gene expression between neighboring cells. By calculating the correlation coefficient of these two vectors, we can convert the probability of a given cell to that of its neighboring cells. Different from CellOracle, we considered only the changes in TGs that were directly connected to the TF when calculating the correlation.
Author Contributions
The detailed contributions are as follows: Guihai Feng designed the overall architecture and specific implementation details of the model. Xin Qin and Wuliang Huang developed the transfer learning framework and implemented its code. Jiahao Zhang initially designed the architecture and code for the downstream tasks. Wentao Cui contributed to the design of these downstream tasks and optimized the code. Yiyang Zhang managed the visualization of TF perturbation results and further optimized the code. Yao Chen conducted experiments with actual knockout models, and Shirui Li was responsible for benchmarking the performance of the model.
Xin Li, Yong Wang, Yiqiang Chen, Shihua Zhang, Yuanchun Zhou, Ming‐Han Tong, Xuezhi Wang, Hongmei Wang and Ge Yang supervised the project. Xin Li, Yong Wang, Yiqiang Chen, Shihua Zhang, Yuanchun Zhou, Ming‐Han Tong and Guihai Feng conceived and designed the study. Guihai Feng, Xin Qin, Jiahao Zhang, Wuliang Huang, Yiyang Zhang, Wentao Cui, Yao Chen and Shirui Li designed the algorithm and performed the experiments. Wenhao Liu, Yao Tian, Yana Liu, Jingxi Dong and Ping Xu were involved in analyzing sequencing data. Zhenpeng Man, Guole Liu, Zhongming Liang, Xiaodong Yang, Xinlong Jiang and Pengfei Wang provided assistance in designing the model. Guihai Feng, Xin Qin, Jiahao Zhang, Yiyang Zhang, Wentao Cui, Wuliang Huang and Shirui Li wrote the manuscript. The X‐COMPASS Project Consortium members collaborated in the paper discussions. All authors read and approved the final version of the manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
Code Availability
All source codes and sample data of CellPolaris are publicly available on GitHub (https://github.com/xCompass‐AI/CellPolaris).
Supporting information
Supporting File: advs73350‐sup‐0001‐SuppMat.docx.
Supplemental Table: advs73350‐sup‐0002‐Table_S1.xlsx.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1I. S. Peter and E. H. Davidson , “Evolution of Gene Regulatory Networks Controlling Body Plan Development,” Cell 144 (2011): 970–985.21414487 10.1016/j.cell.2011.02.017PMC 3076009 · doi ↗ · pubmed ↗
- 2J. P. Thiery and J. P. Sleeman , “Complex networks orchestrate epithelial–mesenchymal transitions,” Nature Reviews Molecular Cell Biology 7 (2006): 131–142.16493418 10.1038/nrm 1835 · doi ↗ · pubmed ↗
- 3W. de Laat and D. Duboule , “Topology of mammalian developmental enhancers and their regulatory landscapes,” Nature 502 (2013): 499–506.24153303 10.1038/nature 12753 · doi ↗ · pubmed ↗
- 4N. Perrimon , C. Pitsouli , and B. Z. Shilo , “Signaling Mechanisms Controlling Cell Fate and Embryonic Patterning,” Cold Spring Harbor Perspectives in Biology 4 (2012): a 005975.22855721 10.1101/cshperspect.a 005975 PMC 3405863 · doi ↗ · pubmed ↗
- 5S. A. Lambert , A. Jolma , L. F. Campitelli , et al., “The Human Transcription Factors,” Cell 175 (2018): 598–599.30290144 10.1016/j.cell.2018.09.045 · doi ↗ · pubmed ↗
- 6S. Kim and J. Shendure , “Mechanisms of Interplay Between Transcription Factors and the 3D Genome,” Molecular Cell 76 (2019): 306–319.31521504 10.1016/j.molcel.2019.08.010 · doi ↗ · pubmed ↗
- 7W. A. Whyte , D. A. Orlando , D. Hnisz , et al., “Master Transcription Factors and Mediator Establish Super‐Enhancers at Key Cell Identity Genes,” Cell 153 (2013): 307–319.23582322 10.1016/j.cell.2013.03.035PMC 3653129 · doi ↗ · pubmed ↗
- 8T. Moerman , S. Aibar Santos , C. Bravo Gonzalez‐Blas , et al., “GRN Boost 2 and Arboreto: efficient and scalable inference of gene regulatory networks,” Bioinformatics 35 (2019): 2159–2161.30445495 10.1093/bioinformatics/bty 916 · doi ↗ · pubmed ↗
