Pooled single-cell screen in colorectal cancer defines transcriptional modules linked to oncogenes

Viola Hollek; Francisca Böhning; Catalina Florez Vargas; Anja Sieber; Markus Morkel; Nils Blüthgen

PMC · DOI:10.1038/s44320-025-00186-2·January 19, 2026

Pooled single-cell screen in colorectal cancer defines transcriptional modules linked to oncogenes

Viola Hollek, Francisca Böhning, Catalina Florez Vargas, Anja Sieber, Markus Morkel, Nils Blüthgen

PDF

Open Access

TL;DR

This study uses single-cell screening to identify transcriptional modules linked to oncogenes in colorectal cancer, which can help predict patient outcomes and improve classification systems.

Contribution

The study introduces a novel method to link oncogenic signaling to transcriptional states and clinical outcomes using a pooled single-cell transcriptomic screen.

Findings

01

Ten conserved oncogene-driven transcriptional modules were identified, representing core cancer phenotypes.

02

High activity in certain modules is associated with worse or better prognosis in CRC patient cohorts.

03

Transcriptional modules improve existing CRC classification systems like CMS.

Abstract

Oncogenic mutations shape colorectal cancer (CRC) biology, yet their impact on transcriptional phenotypes remains incompletely understood, and their individual prognostic value is limited. Here, we perform a pooled single-cell transcriptomic screen of over 100,000 CRC cells with a comprehensive barcoded library of oncogenic variants across genetically diverse CRC lines. Using a variational autoencoder-based interpretable factor model, we identify ten conserved oncogene-driven transcriptional modules (TMOs) representing core cancer phenotypes such as cellular plasticity, inflammatory response, replicative stress, and epithelial-to-mesenchymal transition. Engagement of these modules can be context-dependent, reflecting interactions between oncogene-induced driver pathways and background genetics. TMO activity in patient tumors stratifies CRC cohorts into high- and low-risk groups,…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases4

colorectal cancer CRC inflammatory cancer

Figures13

Click any figure to enlarge with its caption.

Selection of oncogenic variants and model systems for gain-of-function screening in CRC cell lines.(A) Workflow to select oncogenic variants for constructing the oncogene library. (B) Frequency of the selected oncogenic variants in patients across six chosen cancer studies. (C) Schematic representation of the CRC oncogenic signaling network, with colored nodes indicating oncogenic variants used in the screen. (D) Mutational background of the cell line models selected for the screen. (E) Baseline PROGENy pathway activities of the selected unperturbed cell lines across relevant pathways, as show

A), and repeated application of this procedure with *k* = 10 and random seeds confirmed the robustness of the modules (Appendix Fig. 2B).Figure 3Identification and functional characterization of modules spanning the colorectal cancer phenotypic space.(A) Scheme of the conditional variational autoencoder (scVI) with linear decoder to learn gene expression modules, k denotes the number of modules. (B) Selection of the appropriate k by the number of robust modules. The heatmap depicts the number of robust modules determined using Jaccard similarity between all modules of *k* = 2 to *k* = 15. *k*

A).Figure 5Oncogenic MAPK variants drive heterogeneous phenotypic programs.(A) Heatmap of average TMO 1-3 scores induced by oncogeniv variants, grouped by pathway association. (B) Heatmap of TMO 1-3 module activities induced by MAPK pathway oncogenes or the tdTomato control, summarized by line and oncogenic variant. (C) Average TMO 1 (Inflammatory Response) activity versus YAP signaling scores using the Gregorieff EGFR-YAP (top) or Hallmark TNFa via NFkB (bottom) signature scores per cell line and oncogene. For color code, see Fig. [4](#Fig4). (D) Protein levels of YAP1 (top) and phospho-prote

Various oncogenes unlock specific phenotypic modules.(A) Distribution of cells (left) and averaged position per oncogene (right) in the UMAP of the latent space (see Fig. [3F](#Fig3)), colored by oncogene. (B) Distribution of cells (left) and averaged position (right), colored by the highest-ranking TMO. (C) Heatmap of average TMO activities in oncogene-overexpressing cells, ordered by line and transgene. Transgenes are ordered by pathway association. Color code for pathways as in Fig. [2E](#Fig2).

A,B). To test whether the TMOs merely replicate these subtypes or provide additional clinically relevant phenotypic characteristics, we performed survival analysis adjusted for previous subtypes. This analysis showed that even when considering CMS subtypes (Guinney et al, [2015](#CR30)) (Fig. [7A](#Fig7)) or molecular subtypes (Marisa et al, [2013](#CR62)) (Appendix Fig. 7C), the prognostic value remained significant, indicating that the modules encode clinically relevant phenotypes not covered by previous classifiers. Within CMS subgroups, stratification by TMOs provided additional informatio

A). Six out of the ten modules showed significant prognostic value using a proportional hazards regression model (Fig. [6E](#Fig6)): Higher scores for TMO 3 (Replicative Stress) and TMOs 4-5 (Cell Cycle Progression) were associated with lower hazard (HR: 0.35, 0.42, 0.42, respectively), whereas higher scores for TMO 1, 2 and 10 (Inflammatory Response, Plasticity/Resistance and EMT, respectively) were associated with higher hazard (HR: 3.5, 3.4, 4.8, respectively). These associations remained significant after adjusting for clinical parameters such as MSI status or AJCC stage (Fig. [EV3](#Fig10

Funding4

—http://dx.doi.org/10.13039/501100001659Deutsche Forschungsgemeinschaft (DFG)
—Bundesministerium für Forschung, Technologie und Raumfahrt
—http://dx.doi.org/10.13039/501100012353Deutschen Konsortium für Translationale Krebsforschung (DKTK)
—Federal Ministry of Research, Technology and Space (BMFTR)

Keywords

Colorectal CancerTranscriptional ModulesSingle Cell ScreeningOncogenesSignaturesCancerComputational Biology

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSingle-cell and spatial transcriptomics · Ferroptosis and cancer prognosis · Cancer Genomics and Diagnostics

Full text

Introduction

Colorectal cancer (CRC) progresses via stepwise acquisition of cancer driver mutations that interact to establish complex cancer phenotypes. Two stereotypic progression trajectories of CRC have been described: on the one hand, the conventional adenoma-carcinoma pathway is defined by functional loss of APC, and mutations in KRAS and TP53 (Fearon et al, 1990). On the other hand, the serrated pathway is often driven by oncogenic mutations in BRAF or KRAS, and functional loss of MLH1 (Jass, 2007). However, the frequencies of individual mutations are generally low. For instance, the most common KRAS mutation, KRAS (G12D), is found in ~10% of CRCs (The Cancer Genome Atlas Network, 2012) and, collectively, KRAS is mutated in about 40% of patients. Detailed mechanistic studies showed common functions, but also functional differences of individual KRAS mutations (Johnson et al, 2022), as also inferred from their non-equivalent distribution in CRC patients of different age, sex, or subsite (Serebriiskii et al, 2019). Ultimately, cancers are each characterized by their own mutational landscape, regardless of their shared progression via the conventional or serrated pathways.

Theories of tumor evolution provide explanatory models on how individual cancers acquire unique sets of cancer drivers as subsequent mutations interact with preexisting mutational landscapes to increase cancer cell fitness (Laplane and Maley, 2024). Cancer driver effects on signaling networks are specific to tissue, as shown in mouse models (Young and Jacks, 2010), cell lines (Yi et al, 2024), and clinical studies (Schneider et al, 2017; Cook et al, 2021). But even within the same tissue, individual oncoproteins such as KRAS exert different effects depending on the stage (Najumudeen et al, 2024) and the mutational context (Sakai et al, 2018). Therefore, it is challenging to assign specific cancer cell phenotypes and associated transcriptional programs to individual oncogenes. It is not surprising that individual mutations have no or low prognostic value for patient outcomes (Ottaiano et al, 2020; Imamura et al, 2012; Jones et al, 2017; Lavacchi et al, 2022; Nunes et al, 2024).

Cluster-based classification using cancer transcriptomes of patient cohorts showed higher promise for prognosis. Consensus molecular subtypes (CMS) have been defined that stratify CRCs along the progression trajectories, considering either whole-cancer tissue transcriptomes (Guinney et al, 2015) or single-cell cancer transcriptome information (Joanito et al, 2022). These signatures are linked to transcriptional programs that endow cancer cells with multiple functional traits (Chen et al, 2021; Khaliq et al, 2022). CMS classification is associated with clinically relevant features such as cancer localization, microsatellite stability (MSS) versus instability (MSI), and shows moderate prognostic value (Guinney et al, 2015; Joanito et al, 2022).

It is currently unknown how these transcriptional phenotypes are established by oncogenes. Particularly, it is unclear if specific transcriptional responses are regulated by multiple oncogenes, how oncogenic responses depend on the context, and which mutations within the same pathway lead to divergent outcomes. Finally, it needs to be established which transcriptional responses are clinically relevant and allow for better stratification of patients.

Therefore, we set out to link oncogenic drivers, transcriptional modules, and functional traits in a systematic manner by employing recent advances in pooled screens with single-cell read-out (Oren et al, 2021; Ursu et al, 2022). We assembled a comprehensive barcoded library of frequent and rare oncogenic drivers to assess their transcriptomic effects in a panel of CRC cell lines with diverse genetic backgrounds. Cataloging transcriptomes of more than 100K cancer driver-transgenic CRC cells, we find that oncogene-induced gene expression changes are diverse and can be background dependent. We use machine learning to define a limited number of transcriptional modules that are linked to well-defined functional traits of CRC. We show the clinical utility of the transcriptome models by stratification of CRC cohorts into low- and high-risk groups. In summary, our work connects driver oncogenes, their transcriptomic and functional effects, and their clinical relevance as a resource to study CRC progression from bench to bedside.

Results

Pooled single-cell screening for oncogene activities in CRC

To study oncogene functions in CRC, we generated a comprehensive transgene library of known or potential oncogenic drivers. For this, mutational data from six non-redundant CRC studies (Seshagiri et al, 2012; Hoadley et al, 2018; Ellrott et al, 2018; Taylor et al, 2018; Liu et al, 2018; Brannon et al, 2014; Guda et al, 2015; Giannakis et al, 2016; Vasaikar et al, 2019) were retrieved via cBioPortal (Cerami et al, 2012; Gao et al, 2013; de Bruijn et al, 2023), and potential gain-of-function mutations were filtered by their frequency of occurrence, using a fixed set of rules (Fig. 1A). Additionally, we included well-characterized dominant-negative driver mutations in TP53 (Boettcher et al, 2019), SMAD2 and SMAD4 (Woodford-Richens et al, 2001; Fleming et al, 2013) (Fig. 1B). In total, we generated a library comprising 50 transgenes, encompassing mutant and wild-type variants of key drivers of the CRC oncogenic signaling network (Fig. 1C).Figure 1. Selection of oncogenic variants and model systems for gain-of-function screening in CRC cell lines.(A) Workflow to select oncogenic variants for constructing the oncogene library. (B) Frequency of the selected oncogenic variants in patients across six chosen cancer studies. (C) Schematic representation of the CRC oncogenic signaling network, with colored nodes indicating oncogenic variants used in the screen. (D) Mutational background of the cell line models selected for the screen. (E) Baseline PROGENy pathway activities of the selected unperturbed cell lines across relevant pathways, as shown in (C). Source data are available online for this figure.

We chose five cell lines to capture genomic diversity and distinct mutational landscapes of CRC. Caco-2 cells lack mutations in the MAPK pathway, whereas HT-29 and RKO harbor the BRAF^V600E^ variant. In contrast, HCT116 and SW480 carry KRAS^G13D^ and KRAS^G12V^ mutations, respectively. RKO and HCT116 retain wild-type APC, whereas Caco-2, HT-29, and SW480 possess truncating APC mutations. CTNNB1 is mutated in both Caco-2 and HCT116, while TP53 alterations are found in Caco-2, HT-29, and SW480 (Fig. 1D). These cell lines also represent the spectrum of MSS (Caco-2, HT-29, SW480) and MSI (RKO, HCT116) (Ahmed et al, 2013). Consistent with their KRAS- or BRAF-mutant backgrounds, RKO, HCT116, and SW480 exhibited high activity scores of the MAPK and EGFR pathways, as assessed by PROGENy transcriptome analysis (Schubert et al, 2018). Wnt pathway activity was higher in APC-mutant cell lines Caco-2, HT-29, and SW480, and TP53 pathway activity was most pronounced in TP53-wild-type HCT116 cells, with moderate elevation in HT-29 and RKO (Fig. 1E).

We transduced the barcoded oncogene library (Fig. 2A) into the CRC cell lines at low multiplicity-of-infection, and 4 days post-transfection enriched GFP-positive cells via fluorescent-activated cell sorting (FACS) (Fig. 2B). Single-cell transcriptome analysis resulted in a dataset of more than 100,000 cells in which all oncogenic perturbations were well represented across both replicates, including the neutral control transgene tdTomato (Fig. 2C, see Fig. EV1A for quality metrics). We detected transgene overexpression for all constructs (Fig. EV1B).Figure 2. Establishment of the pooled oncogene screen and initial data analysis.(A) Design of the transfer plasmid containing the oncogene coding sequence flanked by Gateway cloning sites with a unique 10 bp nucleotide barcode in the 3´-UTR, and a GFP selection marker. (B) Experimental workflow of the screen: production of ecotropic lentiviral oncogene library, infection of cell lines with the pooled oncogene library, fluorescence-activated cell sorting of GFP-expressing cells, single-cell analysis using separate libraries for barcodes and transcriptomes. (C) Coverage of the oncogenic variants in the screen. Data are cell counts ± SD across the five cell lines. Transgene conditions exceeding 150 cells per replicate and cell line were downsampled, resulting in max. 300 cells for both replicates. (D) UMAPs of single-cell transcriptome data per cell line and after integration of replicates, colored by oncogene. Color code as in (C). (E) Cell line-specific PROGENy pathway activities induced by transgenes. Conditions are labeled by pathway association of the transgenes (see also Fig. 1C). (F) Global transcriptome changes caused by the oncogenes compared to the tdTomato control as quantified by E-Distance. Heatmap shows E-Distance per cell line and replicate. Boxplot shows E-Distance across samples. Box and line show median, Q1 and Q3 values and whiskers extending to 1.5× interquartile range. Rows containing oncogenes are clustered by E-Distance.

Projecting the single-cell data of integrated replicates into UMAP space revealed distinct clustering patterns of oncogenes (Fig. 2D): MAPK-associated oncogene variants (Fig. 2D, blue colors) formed well-defined clusters in all cell lines except for RKO. Furthermore, in all cell lines with TP53 mutations, Caco-2, HT-29, and SW480, the introduction of intact TP53 led to the emergence of distinct cell clusters (Fig. 2D, light red color). Finally, oncogenic CTNNB1 variants in Caco-2 and AKT variants in RKO cells (Fig. 2D, light and dark purple colors, respectively) cluster apart from other variants, suggesting distinct cellular responses.

Applying PROGENy signaling analysis (Schubert et al, 2018) to our dataset revealed key pathways affected across the cell lines and variants, with MAPK, EGFR, NFκB, TGFβ and TNFα among the modulated pathways (Fig. 2E). By quantifying the effect of transgene expression on global transcriptome deregulation using E-distance (Peidli et al, 2024) (Fig. 2F), we noticed that genes known to be involved in MAPK signaling, such as BRAF, MAP2K1, and KRAS mutations, as well as TP53 wild-type and CCND1^T286I^, ranked highest. In contrast, most wild-type proto-oncogenes were positioned in the lower half of the plot, suggesting minimal perturbation. This effect was independent of transgene levels (Appendix Fig. 1).

Identification and functional characterization of modules spanning the phenotypic space

To decompose the transcriptional heterogeneity induced by oncogenes across the cell lines, we used an unsupervised factor model to identify recurrent transcriptional modules. To learn transcriptional programs present across cell lines, we employed a conditional variational autoencoder with a linear decoder (Fig. 3A), as implemented in scVI (Svensson et al, 2020; Lopez et al, 2018). We used “cell line” and “replicate” as batch variables prior to module definition which removes average transcriptome differences between cell lines or replicates. The number of modules is dependent on the size of the latent space (k). We determined the k by computing the Jaccard similarity for k = 2 to k = 15 between all modules and selected the optimal k = 10 (Fig. 3B). Correlation analysis of the gene weights of the decoder confirmed that the 10 modules were non-redundant (Fig. 3C). Thus, the analysis defined ten cell line-overarching transcriptional modules unlocked by oncogenes (TMOs). TMOs are represented as weighted gene signatures that result from the linear decoder. Most modules could also be retrieved when applying the approach to a single-cell line (Appendix Fig. 2A), and repeated application of this procedure with k = 10 and random seeds confirmed the robustness of the modules (Appendix Fig. 2B).Figure 3. Identification and functional characterization of modules spanning the colorectal cancer phenotypic space.(A) Scheme of the conditional variational autoencoder (scVI) with linear decoder to learn gene expression modules, k denotes the number of modules. (B) Selection of the appropriate k by the number of robust modules. The heatmap depicts the number of robust modules determined using Jaccard similarity between all modules of k = 2 to k = 15. k = 10, as the lowest rank with the highest number of robust modules (eight), was chosen for further analysis. (C) Pearson correlation of the gene weights for the transcriptional modules (TMO 1-10). (D) PROGENy pathway activity captured by the TMOs. PROGENy pathway scores were calculated from the gene weights of the transcriptional modules. (E) Gene Set Enrichment Analysis of selected MSigDB hallmark, KEGG, Reactome, and literature-curated pathways using gene weights per module. NES Normalized Enrichment Score. See Fig. EV2B for the complete set of signatures (F) UMAP of the latent space of the conditional variational autoencoder with linear decoder, colored by cell line. (G) UMAPs of the latent space of the linear autoencoder, colored and split by module weights or colored by cell cycle phases (bottom right). (H) Definitions of TMOs, as inferred from analyses in (D–G).

To functionally characterize the ten TMOs, we performed correlation analyses with known functional gene expression signatures, using PROGENy pathway activities and Gene Set Enrichment Analysis (GSEA) (Subramanian et al, 2005) (Figs. 3D,E and EV2A,B). In addition, we confirmed the integration of cell lines and replicates (Figs. 3F and EV2C) and assessed cell cycle states (Figs. 3G and EV2D).

PROGENy analysis showed that TMOs 1-3were associated with high EGFR/MAPK activity across all cell lines (Fig. 3C). However, GSEA suggested that these TMOs reflected distinct consequences of MAPK signaling: TMO 1 best represented an inflammatory response mediated via YAP, TNFɑ/NFκB and elevated Wnt signaling (Fig. 3E) (Schwitalla et al, 2013), with chemokines (CXCL3, CCL20) (Zhou et al, 2023a) and cytokines (IL24, IL23A) among its top genes. TMO 2 scored high for published signatures associated with relapse risk (Fig. 3E), was overrepresented in cell in G1 cell cycle phase (Fig. 3G) and was defined by genes associated with epithelial plasticity (CRCT1, FAM25A, ITGAX) (Kypriotou et al, 2012; Meng et al, 2018a; Hiratsuka et al, 2020; Ohki et al, 2022; Yu et al, 2022b) and therapy resistance (LAMA3, LCN2, EMP1) (Zhu et al, 2020; Li et al, 2025; Chaudhary et al, 2021; Cañellas-Socias et al, 2022). We interpreted it as an epithelial cell state marked by low proliferative activity, adaptability, and survival signaling. TMO 3 exhibited enrichment of DNA damage signatures (Fig. 3E), displayed strong weights of DNA repair genes BARD1, BRCA1, or POLQ (Lange et al, 2011) and was prevalent in cells classified as S-phase (Figs. 3G and EV2D), suggesting that this module captured replicative stress processes.

TMO 4 and 5 exhibited high weights in S- and G2M-phase cells (Fig. 3G), were enriched for cell cycle-related gene sets (Fig. 3E), and activity was restricted to specific cell cycle phases (Fig. EV2D). These modules were therefore defined as “cell cycle progression S” and “cell cycle progression G2M”, respectively.

TMO 6 was enriched for terms associated with differentiated intestinal cell types and colon differentiation (Fig. 3E). Among its top genes were secretory and absorptive lineage markers known to be modulated during CRC progression (REG4, TFF1, FABP1) (Sasaki et al, 2016; Saha et al, 2022; Lawrie et al, 2004), indicating that this module represents features of differentiation processes of the normal and transformed colon epithelium. TMO 7 exhibited high pathway activity for Hypoxia, JAK-STAT, and TP53 signaling, and was enriched for Hallmark Hypoxia and TP53 pathway signatures (Fig. 3D,E). This suggested that the transcriptional program captured by this module is associated with hypoxic and other stress responses. TMO 8 was enriched for metabolism-related gene sets and Goblet cell signatures (Fig. 3E). It was associated with genes encoding extracellular matrix components COL5A2, LOX, and PRS35. Its association with oxidative stress markers (GSTA1, GSTA2, HMOX1) (Mazari et al, 2023) suggested a role in metabolic reprogramming. TMO 9 exhibited high TGFβ signaling activity, a pathway known to promote stemness and tumor progression in advanced CRC stages (Fig. 3D) (Chruścik et al, 2018). It was enriched in gene sets associated with normal colon crypt base cell types and signaling (Fig. 3E), with top genes including MYC, WNT16, HOXB3, HOXB9, and SOX4, suggesting that this module captures processes associated with stemness. Finally, TMO 10 was enriched for signatures related to the epithelial-to-mesenchymal transition (EMT) (Fig. 3E). We noted the presence of Wnt regulators (AXIN2, NOTUM) (Lustig et al, 2002; Kakugawa et al, 2015) among its top genes, alongside EMT, invasion and cancer progression markers such as KRT17, ZEB1, MYH7B, SNAI2, PROX1 (Zhang et al, 2015; Moorman et al, 2023). Taken together, this rigorous characterization led to the module definitions presented in Fig. 3H.

Oncogenes unlock specific phenotypic modules

After disentangling the transcriptional space into ten modules, we assessed the contribution of oncogenic variants to these transcriptional programs. For this, we inspected a UMAP representation of the latent space, revealing regionality of the oncogenes (Fig. 4A) and the TMOs (Fig. 4B). In particular, we noted that cells expressing the CCND1^T286I^ mutation or TP53 wild-type separated from cells with other oncogenes (Fig. 4A) and located to UMAP regions with the highest module weights observed for the cell cycle TMOs 4-5, and the stress response TMO 7, respectively.Figure 4. Various oncogenes unlock specific phenotypic modules.(A) Distribution of cells (left) and averaged position per oncogene (right) in the UMAP of the latent space (see Fig. 3F), colored by oncogene. (B) Distribution of cells (left) and averaged position (right), colored by the highest-ranking TMO. (C) Heatmap of average TMO activities in oncogene-overexpressing cells, ordered by line and transgene. Transgenes are ordered by pathway association. Color code for pathways as in Fig. 2E.

Oncogenic variants associated with the MAPK pathway formed a cluster occupying the same UMAP regions as TMO 1-3, which correspond to inflammatory response, plasticity/drug resistance, and replicative stress. This clustering was consistent with their elevated EGFR/MAPK activity, as indicated by the PROGENy analysis (Figs. 4A,B and 3C). These three modules also exhibited high weights across most oncogenic variants within the MAPK pathway, including BRAF, KRAS, NRAS, and MAP2K1, but not RAF1, with different activity strength across cell lines (Fig. 4C). Furthermore, TMO 7 association of transgenic TP53 wild-type was restricted to TP53-mutant cell lines, namely Caco-2, HT-29, and SW480 (Fig. 4C). TMO 8, related to metabolic reprogramming, showed high module weights in AKT1^E17K^ -expressing cells across all cell lines, suggesting a potential involvement of AKT-mTOR signaling in the metabolic processes captured by this module (Fig. 4C). Notably, high module weights in TMO 10, related to EMT, were observed exclusively in Caco-2 cells expressing CTNNB1^S45F^ and CTNNB1^T41A^ variants (Fig. 4C), further supporting a potential link between oncogenic Wnt signaling and EMT-associated phenotypic plasticity in these cells.

For the remaining modules TMO 6 and TMO 9, related to normal colon and stem cell states, respectively, no clear relationship with specific oncogenic variants was observed in the CRC cells.

Oncogenic MAPK variants induce heterogeneous cancer programs

Oncogenic mutations in KRAS, NRAS, BRAF, or MAP2K1 activate MAPK, a key cancer driver pathway in CRC. As our module analysis suggested that MAPK interacts with and contributes to multiple cancer phenotypes, we wanted to further examine and validate these findings. When we grouped oncogenes by pathway, we could confirm high average weights for MAPK oncogenes in TMO 1-3 (Fig. 5A), as also seen in an orthogonal dataset of Caco-2 cells with inducible MAPK oncogenes (Appendix Fig. 3A).Figure 5. Oncogenic MAPK variants drive heterogeneous phenotypic programs.(A) Heatmap of average TMO 1-3 scores induced by oncogeniv variants, grouped by pathway association. (B) Heatmap of TMO 1-3 module activities induced by MAPK pathway oncogenes or the tdTomato control, summarized by line and oncogenic variant. (C) Average TMO 1 (Inflammatory Response) activity versus YAP signaling scores using the Gregorieff EGFR-YAP (top) or Hallmark TNFa via NFkB (bottom) signature scores per cell line and oncogene. For color code, see Fig. 4. (D) Protein levels of YAP1 (top) and phospho-protein levels of NFkB phosphorylated at S536 (bottom) in Caco-2 cells expressing different MAPK oncogenes as measured by mass cytometry. Significant differences to the control were determined using a Wilcoxon rank-sum test and P adjustment using the Benjamini–Hochberg method. N = 95,633 cells. (E) TMO 2 (Plasticity/Drug Resistance; left) and TMO 3 (Replicative Stress; right) weights by cell cycle phase. N = 48,069 cells. (F) Fraction of HT-29 cells classified as G1 expressing MAPK oncogenes or tdTomato control as indicated. (G) Phospho-protein levels of CHK1 phosphorylated at S345 and CHK2 phosphorylated at T68 in Caco-2 cells expressing different MAPK pathway oncogenes as measured by mass cytometry. N = 95,633 cells. (H) Average colony size in HT-29 cells expressing MAPK pathway-associated oncogenes or tdTomato control as indicated, 7 days after seeding single cells. N = 626 colonies. (I) Representative colonies of HT-29 cells transduced with oncogenic variants, as indicated, on day 7 post-seeding. Scale bar 200 µm. For further colonies and transgene-related fluorescence overlays, see Appendix Fig. 3B. (D, E, G, H) Box-and-whiskers plots show median, Q1 and Q3 values and whiskers extending to 1.5× interquartile range. Significant differences to the control were determined using a Wilcoxon rank-sum test and P adjustment using the Benjamini–Hochberg method. Significance levels are: ****P < 0.0001, ***P < 0.001, **P < 0.01, *P < 0.05, ns = P > 0.05. P values for panel (D) for YAP1 expression (top): BRAF-V600E: 4.87e-13, BRAF-WT: 0, KRAS-G12D: 0, KRAS-Q61L: 0, KRAS-WT: 0; for pNFkB expression (bottom): BRAF-V600E: 4.16e-4, BRAF-WT: 6.76e-43, KRAS-G12D: 4.99e-30, KRAS-Q61L: 1.19e-22, KRAS-WT: 2.30e-22. P values for (G) for pChk1 expression (left): BRAF-V600E: 1.96e-6, BRAF-WT: 0, KRAS-G12D: 0, KRAS-Q61L: 8.96e-304, KRAS-WT: 2.79e-53; for pChk2 expression (right) are: BRAF-V600E: 6.47e-3, BRAF-WT: 2.11e-27, KRAS-G12D: 6.21e-270, KRAS-Q61L: 5.03e-172, KRAS-WT: 2.31e-5. P values for panel (H) are: KRAS-G12D: 0.538, KRAS-G13D: 0.020, KRAS-Q61L: 1.22e-4, MAP2K1-Q56P: 0.020. Source data are available online for this figure.

The inflammatory response TMO 1 was preferentially induced by BRAF^V600E^, KRAS^Q61L^ and NRAS^Q61K^ in RAS/RAF-wild-type Caco-2 cells, whereas almost all MAPK driver mutants induced this module in the BRAF-mutant cell lines HT-29 and RKO (Fig. 5B). Our GSEA analysis (Fig. 3E, above) suggested involvement of YAP-Hippo and TNFα/NFκB signaling in TMO 1, and in agreement with this analysis, the MAPK driver mutants strongly induced both pathways in MAPK-wild-type and BRAF-mutant cell lines (Fig. 5C). This was also validated by mass cytometry for total YAP1 and phospho-NFκB protein abundance in Caco-2 cells overexpressing MAPK oncogenes (Fig. 5D). These data reflect downstream pathway activation in the inflammatory response, orchestrated by MAPK mutants.

Based on cell cycle classifications, TMO 2, related to cell plasticity and drug resistance, was associated with enrichment of G1 cell cycle phase, whereas TMO 3, related to replicative stress, was associated with the S cell cycle phase (Fig. 5E). To directly assess the effects on the cell cycle, we analyzed the cell cycle distributions by flow cytometry upon MAPK oncogene overexpression in the HT-29 cell line and found a consistent increase in G1 phase compared to control cells (Fig. 5F). Mass cytometry revealed in addition phosphorylation of S phase replicative stress effector proteins CHK1 and CHK2 in Caco-2 cells (Fig. 5G). Both G1 phase enrichment and replicative stress would suggest a decrease in proliferation. Indeed, HT-29 cells expressing MAPK oncogenes exhibit a significantly reduced colony size (Fig. 5H) and an altered morphology compared to control cells (Fig. 5I; Appendix Fig. 3B). Taken together, the data confirm a model of collaborating MAPK-induced phenotypic modules regulating inflammatory response, cell plasticity, and replicative stress.

Colorectal cancer patients can be stratified using TMOs

After having established the functional relevance of the TMOs, we aimed to assess their clinical relevance. For this, we reduced the modules to gene expression signatures suitable to study tumor cell-intrinsic properties in clinical cohorts (Fig. 6A). The modules consist of gene lists ranked by weight from which we removed genes with predominant immune or stromal expression using single-cell transcriptome data from CRC patients (Uhlitz et al, 2021). Next, we determined a cut-off for each module by computing the Hazard Ratio (HR) of progression-free survival (PFS) for each module in our discovery cohort (TCGA-COADREAD with 592 patients) (Hoadley et al, 2018; Ellrott et al, 2018; Taylor et al, 2018; Liu et al, 2018). This resulted in epithelium-centered module signatures with sizes of 60 to 160 genes (Dataset EV2). We confirmed that the filtering of the signatures did not deteriorate their performance when assessed on our original screening data (Appendix Fig. 4).Figure 6. Colorectal cancer patients can be stratified using TMO signatures.(A) Scheme of the strategy to determine the optimal cut-off for module signatures. (B) Differences of module scores across clinical parameters using the TCGA cohort (n = 592 patients). Statistical significance was determined using a Wilcoxon rank-sum test for pairwise comparisons and a Kruskal–Wallis test for multiple conditions. (C) Average module score in patients with BRAF, KRAS or TP53 mutations as indicated for the TCGA cohort (top) and the Marisa et al, cohort (bottom). Significance was determined using a Wilcoxon rank-sum test against patients with annotated wild-type alleles of the respective genes. (D Workflow of clinical cohort analysis. (E) Proportional hazards regression model analysis of relapse-free survival in the Marisa et al, cohort testing for associations with TMO scores. Points indicate maximum likelihood estimates for hazard ratio, bars indicate the 95% confidence interval. Color indicates significance which was determined using a Wald test followed by multiple testing correction using the Benjamini–Hochberg method. *N *= 557 patients. (F) Meta analysis using a random-effects model of proportional hazards regression for relapse-free survival in three additional validation cohorts with TMO scores. Data are displayed as in (E). N = 510 patients. (G) Overlap of patient groups with high and low module scores (cut-off determined by survminer). The relative overlap per group was determined by dividing the observed overlap by the expected overlap of patients in the high or low group of two modules, and log2-transformed. (H) Kaplan–Meier plots of patients stratified by four significant TMOs in the Marisa et al, cohort (cut-off determined by survminer). Significance was calculated using a log-rank test. Significance levels are: ****P < 0.0001, ***P < 0.001, **P < 0.01, *P < 0.05, ns = P > 0.05. P values for (E) are: TMO 1: 1.90e-5, TMO 2: 1.87e-5, TMO 3: 8.05e-6, TMO 4: 8.05e-6, TMO 5: 2.14e-5, TMO 6: 0.995, TMO 7: 0.995, TMO 8: 0.145, TMO 9: 0.658, TMO 10: 6.75e-5; P values for (F) are: TMO 1: 3.71e-5, TMO 2: 1.81e-6, TMO 3: 0.067, TMO 4: 0.056, TMO 5: 0.094, TMO 6: 0.057, TMO 7: 0.25, TMO 8: 0.63, TMO 9: 0.15, TMO 10: 5.99e-3. P values for (H) are: TMO 1: 1.57e-06, TMO 2: 7.61e-07, TMO 3: 6.37e-08, TMO 10: 2.35e-06.

Correlation of TMO signature scores with clinical parameters showed only a weak association with stage or clinical subtype, for example decreasing Normal Colon or Replicative Stress scores with increasing T stage (Fig. 6B; Appendix Fig. 5). A notable exception was TMO 2 (Plasticity/Drug Resistance), which was elevated in mucinous adenocarcinomas (Fig. 6B; Appendix Fig. 5). We found strong associations of most of the modules with MSI status, as well as with previously published transcriptome-based classification systems such as CMS and iCMS (Guinney et al, 2015; Joanito et al, 2022) (Fig. 6B; Appendix Fig. 5). For instance, the inflammatory response-related TMO 1 (Inflammatory Response) was highest in CMS1 MSI Immune and CMS4 Mesenchymal CRCs, TMO 3 (Replicative Stress) was elevated in CMS1 and decreased in CMS4, and TMO 10 (EMT), was high in CMS4, in line with their respective functional roles (Appendix Fig. 5).

Because our modules were derived from oncogene activities on the transcriptome, we asked whether TMO scores were associated with mutational profiles in patients. We found that BRAF-mutant CRCs had high scores for the MAPK-related TMOs 1-3, but also for the Cell Cycle Progression modules TMO 4 and 5 (Fig. 6C). In contrast, BRAF-mutant CRCs showed low scores for TMO 9 (Stemness) and TMO 10 (EMT). KRAS-mutant cancers showed positive associations with TMO 1 and TMO 2, but also to a lower extent with TMO 9 and TMO 10. In addition, TP53 mutations were negatively associated with TMO 2, TMO 6, and TMO 7.

We next asked whether the modules have prognostic value. Since our signature cutoffs were trained on the PFS of the discovery cohort, we used an independent validation cohort of 566 patients (Marisa et al, 2013) for the survival analysis to avoid bias (Fig. 6D). Clinical parameters of this cohort showed similar associations as with the discovery cohort (Appendix Fig. 6A). Six out of the ten modules showed significant prognostic value using a proportional hazards regression model (Fig. 6E): Higher scores for TMO 3 (Replicative Stress) and TMOs 4-5 (Cell Cycle Progression) were associated with lower hazard (HR: 0.35, 0.42, 0.42, respectively), whereas higher scores for TMO 1, 2 and 10 (Inflammatory Response, Plasticity/Resistance and EMT, respectively) were associated with higher hazard (HR: 3.5, 3.4, 4.8, respectively). These associations remained significant after adjusting for clinical parameters such as MSI status or AJCC stage (Fig. EV3; Appendix Fig. 6B). A meta-analysis of three additional smaller validation cohorts (Sadanandam et al, 2013; De Sousa et al, 2011; Laibe et al, 2012) using a random-effects model confirmed these results (Fig. 6F; Appendix Fig. 6C–E).

We stratified patients into high and low groups for the module activities using survminer (Kassambara et al, 2017). Patient groups stratified by TMO 1, TMO 2, and TMO 10 (Inflammatory Response, Plasticity/Drug Resistance, and EMT, respectively) were largely independent (Fig. 6G). In contrast, patient groups scoring low for TMO 3 (Replicative Stress) and the cell cycle progression modules TMP 4 and 5 overlapped. We observed strong survival differences for patients stratified by TMOs 1-3 (Inflammatory Response, Plasticity/Drug Resistance, and Replicative Stress) and TMO 10 (EMT) (Fig. 6H), indicating their prognostic value.

TMOs represent clinically relevant information beyond previous classifications

Previous transcriptome-based classification (Guinney et al, 2015; Marisa et al, 2013) had moderate prognostic value (Appendix Fig. 7A,B). To test whether the TMOs merely replicate these subtypes or provide additional clinically relevant phenotypic characteristics, we performed survival analysis adjusted for previous subtypes. This analysis showed that even when considering CMS subtypes (Guinney et al, 2015) (Fig. 7A) or molecular subtypes (Marisa et al, 2013) (Appendix Fig. 7C), the prognostic value remained significant, indicating that the modules encode clinically relevant phenotypes not covered by previous classifiers. Within CMS subgroups, stratification by TMOs provided additional information on survival (Fig. 7B; Appendix Fig. 7D,E). CMS2 and CMS3, representing different cancer cell-intrinsic CRC subtypes (Joanito et al, 2022), were disentangled by TMO 1 and TMO 2 (Inflammatory Response and Plasticity/Drug Resistance, respectively). TMO 1 was able to stratify CMS3 patients with a HR of 5.7 (Fig. 7C), whereas TMO 2 was able to stratify CMS2 patients with a HR of 3.7 (Appendix Fig. 7E). Furthermore, TMO 3 and 10 (Replicative Stress and EMT, respectively) allowed to separate the high-risk CMS4 subtype with HRs of 0.5 and 2.0, respectively (Fig. 7B,D).Figure 7TMOs represent clinically relevant information beyond CMS classification.(A) Proportional hazards regression model analysis of relapse-free survival, adjusted for CMS classification, in the Marisa et al, cohort testing for associations with TMO scores. Points indicate maximum likelihood estimates for hazard ratio. Bars indicate the 95% confidence interval. Color indicates significance. Multiple testing correction was performed using the Benjamini–Hochberg method. (B, C) Differential survival of patients classified by CMS and in addition stratified by TMO scores. (B) Statistical tests for significance of differential survival classified by CMS subgroups and in addition stratified for low versus high scores of TMO 1, TMO 2 TMO 3 or TMO 10. (C) Kaplan–Meier plot of patients classified as CMS3 and in addition stratified by high or low scores of TMO 1 (cut-off determined by survminer). (D) Kaplan–Meier plots of patients classified as high-risk CMS4 and in addition stratified by high or low scores of TMO 3 (left) or TMO 10 (right) (cut-off determined by survminer). (E) Kaplan–Meier plot of the Replicative Stress module (cut-off determined by survminer) stratified by adjuvant chemotherapy regime in AJCC stage 3 patients. Significance was calculated using a log-rank test. Significance levels are: ****P < 0.0001, ****P *< 0.001, **P < 0.01, *P < 0.05, ns = P >0.05. P values for (A) are: TMO 1: 7.61e-4, TMO 2: 7.61e-4, TMO 3: 2.60e-3, TMO 4: 2.60e-3, TMO 5: 7.66e-3, TMO 6: 0.871, TMO 7: 0.896, TMO 8: 0.253, TMO 9: 0.792, TMO 10: 0.015. P values for (C) are: overall: 2.78e-05, CMS3: TMO 1-High vs. TMO 1-Low: 2.43e-04. P values for (D) are: TMO 3 + CMS4 (left) overall: 1.30e-07; CMS4: TMO 3-High vs. TMO 3-Low: 5.11e-3; TMO 10 + CMS4 (right) overall: 5.83e-08; CMS4: TMO 10-High vs. TMO 10-Low: 0.011. P values for (E) are: overall: 1.79e-05; TMO 3-Low: w/Chemo vs. w/o Chemo: 0.045; w/Chemo: TMO 3-High vs. TMO 3-Low: 1.93e-3; w/o Chemo: TMO 3-High vs. TMO 3-Low: 1.47e-4.

We also compared our TMOs to signatures derived from experimental systems or of pre-neoplastic lesions. These comprise a humanized Lgr5-centered intestinal stem cell signature related to CRC relapse (Merlos-Suárez et al, 2011), a signature of CRC metaplasia (Joanito et al, 2022), a signature of epithelial high-risk genes related to CRC recurrence (EpiHR) (Cañellas-Socias et al, 2022), and a signature of chemoresistant reserve-like-stem cells expressing MEX3A (Barriga et al, 2017) (Dataset EV1).

Of note, the EpiHR and MEX3A signatures show a low prognostic value (Fig. EV4A), and particularly the EpiHR signature overlaps with the TMOs 1, 2, and 10 (Inflammatory response, Plasticity/Drug Resistance, and EMT, respectively) that are likewise associated with bad prognosis (Fig. EV4B). Adjusting for these signatures retains significant prognostic values of our TMOs, except for EpiHR, where only TMO3 (Replicative Stress) remains significant. This suggests that the TMOs 1, 2, and 10 cover different aspects of the EpiHR signature (new Fig. EV4C).

We also found a predictive association of TMO 3 (Replicative Stress) in patients classified as TNM stage 3 who frequently profit from adjuvant chemotherapy (Fig. 7E). Patients with a low TMO 3 score showed worse survival; however, only this subgroup of patients benefited from chemotherapy.

Discussion

By combining pooled perturbation screens with single-cell readouts and contemporary machine learning methods, we systematically mapped out how oncogenes shift cell states in a panel of CRC cell lines. We defined ten modules representing transcriptional diversity induced by oncogenes that represent cancer-relevant phenotypes. Oncogene activities and module engagement differed between CRC cell lines, suggesting that oncogene activity depends on context such as other present mutations. Several modules, particularly those termed TMO 1 Inflammatory Response, TMO 2 Plasticity/Drug Resistance, TMO 3 Replicative Stress, and TMO 10 EMT, had prognostic value across multiple cohorts of CRC patients. In addition, TMO 3 predicted higher patient survival independent of chemotherapy admission, but only patients with low TMO 3 scores benefited from chemotherapy.

Bulk and single-cell transcriptome analysis has served as a foundation for CRC stratification for over a decade. The cluster-based methodology underlying CMS and iCMS classification (Guinney et al, 2015; Joanito et al, 2022) has since been expanded through the integration of gene ontology data (Malla et al, 2024) and the application of machine learning techniques (Yazdani et al, 2025; Zhou et al, 2023b). Our approach differs from these approaches as our modules are based on cancer cell-intrinsic functions induced by oncogenes. The TMOs offer additional stratification not covered by previous classifications, as they remain prognostic even when adjusting for CMS subtypes. For instance, we found a strong negative prognostic value of TMO 1 particularly in CMS3 patients (Fig. 7C). TMO 1 was linked to high activity of MAPK, NFκB, and Wnt/β-Catenin (Fig. 3D), a molecular profile matching well-defined mouse models of inflammatory CRC progression (Schwitalla et al, 2013). Thus, TMO 1 defines a subgroup of CMS3 CRCs that could potentially benefit from therapeutic anti-inflammatory intervention. As another example, low activity of TMO 3 (Replicative stress) and high activity of TMO10 (EMT) could further specify groups of high-risk CRC patients of the CMS4 subtype, which is associated with the lowest relapse-free survival rates.

The TMOs can be functionally interpreted, as they show clear associations to specific cancer pathways and functions rather than to oncogenic mutations present in a patient, suggesting that multiple combinations of genetic or epigenetic driver events could unlock similar transcriptomic responses. The modules correlated with established signatures reflecting CRC biology (Joanito et al, 2022) and specific CRC cell states (Álvarez-Varela et al, 2022; Cañellas-Socias et al, 2022), reflecting their cooperative roles in establishing the cancer phenotype. In addition, we noticed that, in cell lines, particular oncogenes unlock different modules contingent on the genomic background, as exemplified by MAPK-related oncogenes activating different combinations of the TMOs 1–3 in cell lines (Fig. 5B). Thus, our approach allows cataloging genotype-phenotype relations relevant for cancer. Accordingly, our experimental strategy represents a shift from static subtyping towards stratification based on dynamic, experimentally defined dependencies.

Large-scale screens have advanced our understanding of oncogenesis and therapy response. However, only a few studies have systematically explored the cancer driver landscape using gain-of-function approaches (Martz et al, 2014; Ng et al, 2018; Koivu et al, 2024). Many more studies focused on the effects of loss-of-function mutations, by genome-wide (Chen et al, 2015; Michels et al, 2020; Ringel et al, 2020; Yu et al, 2022a; Wan et al, 2021) or gene set-specific (Spisak et al, 2024; Picco et al, 2021) CRISPR screening. Consequently, transgenic expression of oncogenes remains underutilized as a powerful approach for functional dissection of cancer functions.

It is of note that our screening approach is limited to tumor cell-intrinsic oncogenic effects in an already established cancer genome background and to a limited set of cell lines growing in an in vitro 2D microenvironment. Thus, the context-specificity of oncogenes is not fully captured and would require a larger set of cell line models. Moreover, the derived signatures only represent tumor cell intrinsic programs, and other approaches are required to classify the microenvironment for prediction or prognosis. We would also expect that oncogenes can engage with further transcriptional modules in normal or pre-cancer cell genomes, as well as in normal or cancer cells growing in a complex three-dimensional microenvironment, e.g., in organoid or canceroid cultures, xenografts, or genetic mouse models of cancer. As oncogenic mutation frequencies strongly differ between different cancer entities (Campbell et al, 2020), it will be interesting to see whether modules and relationships between oncogenes and modules are conserved or entity-specific.

Methods

Reagent and tools tableReagent/resourceReference or sourceIdentifier or catalog number Experimental models HEK293T (H. sapiens)ECACC04072121Caco-2 (H. sapiens)CellosaurusCVCL_0025HT-29 (H. sapiens)CellosaurusCVCL_0320RKO (H. sapiens)CellosaurusCVCL_0504SW480 (H. sapiens)CellosaurusCVCL_0546HCT 116 (H. sapiens)CellosaurusCVCL_0291 Recombinant DNA pRRL-RIEP (encodes mCAT1 receptor)Institute of Pathology, CharitéProf. Dr. Christine Sers, Charité, Germanyp_sc-eVIP-Δpuro-EGFP-tdTomatoThis study/Addgene#249437p_sc-eVIP-Δpuro-EGFP-ACKR3This study/Addgene#249438p_sc-eVIP-Δpuro-EGFP-ACKR3-R161CThis study/Addgene#249439p_sc-eVIP-Δpuro-EGFP-AKT1This study/Addgene#249440p_sc-eVIP-Δpuro-EGFP-AKT1-E17KThis study/Addgene#249441p_sc-eVIP-Δpuro-EGFP-ARThis study/Addgene#249442p_sc-eVIP-Δpuro-EGFP-AR-Q58LThis study/Addgene#249443p_sc-eVIP-ΔPuro-EGFP-BCL6This study/Addgene#249444p_sc-eVIP-ΔPuro-EGFP-BCL6-R594QThis study/Addgene#249445p_sc-eVIP-ΔPuro-EGFP-BRAFThis study/Addgene#249446p_sc-eVIP-ΔPuro-EGFP-BRAF-N581SThis study/Addgene#249447p_sc-eVIP-ΔPuro-EGFP-BRAF-V600EThis study/Addgene#249448p_sc-eVIP-ΔPuro-EGFP-CCND1This study/Addgene#249449p_sc-eVIP-ΔPuro-EGFP-CCND1-T286IThis study/Addgene#249450p_sc-eVIP-ΔPuro-EGFP-CTNNB1This study/Addgene#249451p_sc-eVIP-ΔPuro-EGFP-CTNNB1-S45FThis study/Addgene#249452p_sc-eVIP-ΔPuro-EGFP-CTNNB1-T41AThis study/Addgene#249453p_sc-eVIP-ΔPuro-EGFP-FESThis study/Addgene#249454p_sc-eVIP-ΔPuro-EGFP-FES-R25HThis study/Addgene#249455p_sc-eVIP-ΔPuro-EGFP-GNASThis study/Addgene#249456p_sc-eVIP-ΔPuro-EGFP-GNAS-R201HThis study/Addgene#249457p_sc-eVIP-ΔPuro-EGFP-KIT-E257DThis study/Addgene#249458p_sc-eVIP-ΔPuro-EGFP-KRASThis study/Addgene#249459p_sc-eVIP-ΔPuro-EGFP-KRAS-A146TThis study/Addgene#249460p_sc-eVIP-ΔPuro-EGFP-KRAS-G12DThis study/Addgene#249461p_sc-eVIP-ΔPuro-EGFP-KRAS-G13DThis study/Addgene#249462p_sc-eVIP-ΔPuro-EGFP-KRAS-Q61LThis study/Addgene#249463p_sc-eVIP-ΔPuro-EGFP-MAP2K1This study/Addgene#249464p_sc-eVIP-ΔPuro-EGFP-MAP2K1-G128DThis study/Addgene#249465p_sc-eVIP-ΔPuro-EGFP-MAP2K1-Q56PThis study/ Addgene#249466p_sc-eVIP-ΔPuro-EGFP-NFKB2This study/Addgene#249467p_sc-eVIP-ΔPuro-EGFP-NFKB2-R52QThis study/Addgene#249468p_sc-eVIP-ΔPuro-EGFP-NRASThis study/Addgene#249469p_sc-eVIP-ΔPuro-EGFP-NRAS-G12DThis study/Addgene#249470p_sc-eVIP-ΔPuro-EGFP-NRAS-Q61KThis study/Addgene#249471p_sc-eVIP-ΔPuro-EGFP-PAX3This study/Addgene#249472p_sc-eVIP-ΔPuro-EGFP-PAX3-T424MThis study/Addgene#249473p_sc-eVIP-ΔPuro-EGFP-PDGFRAThis study/Addgene#249474p_sc-eVIP-ΔPuro-EGFP-PDGFRA-R151HThis study/Addgene#249475p_sc-eVIP-ΔPuro-EGFP-PIK3CAThis study/Addgene#249476p_sc-eVIP-ΔPuro-EGFP-PIK3CA-E545KThis study/Addgene#249477p_sc-eVIP-ΔPuro-EGFP-PIK3CA-H1047RThis study/Addgene#249478p_sc-eVIP-ΔPuro-EGFP-RAF1This study/Addgene#249479p_sc-eVIP-ΔPuro-EGFP-RAF1-S257LThis study/Addgene#249480p_sc-eVIP-ΔPuro-EGFP-SMAD2This study/Addgene#249481p_sc-eVIP-ΔPuro-EGFP-SMAD2-R321QThis study/Addgene#249482p_sc-eVIP-ΔPuro-EGFP-SMAD4This study/Addgene#249483p_sc-eVIP-ΔPuro-EGFP-SMAD4-R361HThis study/Addgene#249484p_sc-eVIP-ΔPuro-EGFP-TP53This study/Addgene#249485p_sc-eVIP-ΔPuro-EGFP-TP53-R175HThis study/Addgene#249486p_sc-eVIPAddgene#168174psPAX2Addgene#12260pCAG-EcoAddgene#35617pRSV-RevAddgene#12253 Antibodies 5-Iodo-2’-deoxyuridine (IdU)Standard Biotools201127Cell-ID™ CisplatinStandard Biotools201064anti-BrdU antibodyBD Biosciences347580Alexa Fluor™ 405-conjugated F(ab)-Goat anti-Mouse IgG1 Fc antibodyThermo FisherA-31553 Oligonucleotides and other sequence-based reagents Targeted PrimerThis studySee methodsPartial readOne PrimerThis studySee methods Chemicals, enzymes, and other reagents Gateway LR Clonase II Enzyme MixThermo Fisher11791020ROTI®Cell DMEM Low GlucoseCarl Roth9027.1Stable GlutaminePAN-BiotechP04-82100FBS Xtra, Fetal Bovine Serum (FBS), Collected in South AmericaCapricorn-scientificFBS-16APenicillin-Streptomycin, 10,000 U/ml Penicillin, 10 mg/ml StreptomycinPAN-BiotechP06-07050Polyethylenimine, Linear, MW 25000, Transfection Grade (PEI 25 K™)Polysciences23966-100PolybreneSanta Cruz Biotechnologysc-134220DPBS, w/o: Ca and MgPAN-BiotechP04-36500Ecotropic Receptor BoosterTakara Bio631471Subcloning Efficiency™ DH5α™ Competent CellsThermo Scientific18265017One Shot™ ccdB Survival™ 2 T1R Competent CellsThermo ScientificA10460KAPA HiFi HotStart ReadyMixRoche7958927001SPRIselect beadsBeckman CoulterB23318Chromium Next GEM Single Cell 3ʹ Kit v3.1 (dual-Index)10X GenomicsPN-1000268Chromium Next GEM Chip G Single Cell Kit10X GenomicsPN-1000120Dual-Index Kit TT Set A10X GenomicsPN-1000215phosphatase/protease inhibitorsRoche4906837001Cell Staining Buffer MaxparStandard Biotools2010681 mL Pierce™ Methanol-free Formaldehyde Ampules, 16% (w/v)Thermo Fisher Scientific28906AccutaseGibco25-058-CIBenzonaseThermo Fisher Scientific88702Cell-ID 20-Plex Pd Barcoding KitStandard Biotools201060Propidium iodide Ready Flow ReagentThermo Fisher ScientificR37169Triton X-100Sigma-AldrichT8787Bromodeoxyuridine (BrdU)BD Biosciences550891CompactPrep Plasmid Maxi Kit (25)Qiagen12863QIAprep Spin Miniprep Kit (50)Qiagen27104 Software R/ R StudioPythonCellranger (v6.1.1) https://www.10xgenomics.com/support/software/cell-ranger/downloads FlowJo (v 10.9.0) https://www.flowjo.com Affinity Designer (1.10.8) Other Chromium-X controller10X GenomicsNova Seq X Plus 25BIllumina Inc.TapeStationAgilentIncucyteSatoriusFACSAria sorterBD Biosciences

Oncogene selection

Mutational data for oncogenes were obtained from cBioPortal, drawing from the following studies: Genentech (Seshagiri et al, 2012), TCGA PanCancer Atlas (Hoadley et al, 2018; Ellrott et al, 2018; Taylor et al, 2018; Liu et al, 2018), MSK (Brannon et al, 2014), CaseCCC (Guda et al, 2015), DFCI (Giannakis et al, 2016), and CPTAC-2 Prospective (Vasaikar et al, 2019). Patient data were merged across studies using unique patient IDs. Genes were filtered to include only those with a coding sequence shorter than 4000 base pairs. Missense mutations in tumor suppressor genes (Zhao et al, 2016) were excluded unless known to exhibit dominant effects. The resulting gene list was prioritized based on the mutational frequency across all patients. To capture regional mutational enrichment, a sliding window approach was applied across each gene’s coding sequence to identify hotspot regions (defined as >0.1% mutation frequency). In such cases, the most frequent mutation within the hotspot or a well-characterized gain-of-function mutation was selected.

Plasmid design and handling

The transfer plasmid p-sc-eVIP-EGFP was adapted from p-sc-eVIP (Addgene ID: 168174) (Ursu et al, 2022), by incorporating L/R gateway sites, EGFP as a selection marker, and unique 10 bp barcode sequences. Barcodes were positioned between the polyadenylation signal sequence and the L/R sites at the 3′ end of the coding sequence to enable identification of transgenes in single-cell RNA-sequencing (scRNA-seq). Gateway cloning was performed using LR Clonase enzyme mix (Thermo Scientific) to shuttle the (proto-)oncogene of interest into the destination vector, following the manufacturer’s standard protocol. All transfer plasmids used in this study are available from Addgene (Addgene IDs: 249437-249486).

All plasmids were transformed and propagated in E. coli DH5α competent cells (Thermo Scientific) using appropriate selection. The p-sc-eVIP backbone contains the ccdB gene, necessitating the use of ccdB Survival 2 T1R Competent Cells (Thermo Fisher).

Cell culture

Cell lines were maintained for no more than ten passages in low-glucose DMEM (Roth) supplemented with 10 % fetal bovine serum (FBS) (PanBiotech), 1 % stable L-glutamine (PanBiotech), and 1 % penicillin/streptomycin (PanBiotech) at 37 °C and 5 % CO_2_. For cell lines expressing transgenic m-CAT1 receptor (HT-29-etP, RKO-etP, Caco-2-etP, SW480-etP), the medium was supplemented with puromycin to maintain selection pressure.

Pseudotyped ecotropic lentiviruses were generated in HEK293T cells using the packaging plasmids psPAX2 (Addgene ID: 12260) and pRSV-Rev (Addgene ID: 12253), along with the envelope plasmid pCAG-Eco (Addgene ID: 35617), which encodes the ecotropic envelope protein. The destination vector p-sc-eVIP-EGFP was included at 2.5 times the molar ratio of the other plasmids. Cells were transfected at 80 % confluency using PEI 25K (Polyscience), and viral supernatants were harvested at 48 and 72 h post transfection. The collected virus was filtered through a 0.2-µm filter and stored at –80 °C for long-term use.

CRC cells were infected at 50 % confluency by combining viral supernatant with 8 µg/mL Polybrene (Santa Cruz Biotechnology) at a maximum multiplicity of infection (MOI) of 0.2. HCT116 cells were pre-treated with Ecotropic Receptor Booster (Takara Bio) prior to viral exposure, according to the manufacturer´s instructions. Cells were exposed to the virus for 20 h before medium change. All cells processed for the single-cell datasets in this study were harvested 4 days post-infection (dpi). Cell lines were regularly checked for mycoplasma infection and authenticated using STR profiling and SNP calling in the scRNA-seq libraries.

Single-cell RNA library preparation

Gene expression (GEX) libraries were created from 10,000 to 30,000 cells transfected with the lentiviral oncogene library and sorted for green fluorescence, using a FACSAria sorter (BD Biosciences) and the Chromium-X controller (10X Genomics), and Chromium Next GEM Single Cell 3ʹ Reagent Kits v3.1 (10X Genomics), following standard dual-index protocols. Libraries were sequenced on the Nova Seq X Plus 25B with 150 PE (Illumina, Inc.).

Targeted library preparation

Targeted libraries for barcode identification of transgenes were generated using amplified cDNA that served as input for the GEX library. PCR amplification was performed using a targeted primer (5’-GTGGTGTTAATTAACCTAGAGGGC-3') containing a specific handle (5′-GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT-3′) for the second PCR and a partial Read One primer (IDT; 5′-CTACACGACGCTCTTCCGATCT-3′, 10X Genomics). This primer pair was designed to amplify the 10 bp barcode region introduced by the viruses. Each 100 µL PCR reaction contained 100 ng cDNA, 2.5 µL of 100 µM targeted primer, 4 µL of 10 µM Read One primer, and 50 µL KAPA HiFi HotStart ReadyMix (Roche, KK2601), with PCR-grade water added to volume. The thermal cycling program was: 1× (95 °C, 3 min), 5× (98 °C, 20 s, 67 °C, 60 s, 72 °C, 60 s), 1× (72 °C, 5 min).

PCR products underwent a two-sided size selection using SPRIselect beads (Beckman Coulter) at 0.6× and 1.2×, following the Chromium Next GEM Single Cell 3′ Reagent Kits v3.1 (10X Genomics) protocol. For the index PCR, 30 µL of the cleaned product was used, and all subsequent steps followed the standard 10X Genomics indexing and clean-up workflow. The index PCR consisted of 14 amplification cycles. Product quality and size (~360 bp) were verified using TapeStation (Agilent).

Mass cytometry

For mass cytometry, Caco-2 cells were seeded and infected as described above. On day 4 post-infection, cells were pulsed with 10 µM 5-Iodo-2'-deoxyuridine (IdU, Standard Biotools) for 25 min at 37 °C. Subsequently, 2 µM cisplatin (Standard Biotools) and phosphatase/protease inhibitors (at recommended working concentrations, Roche) were applied for 5 min at 37 °C. Cells were washed once with Cell Staining Buffer (Standard Biotools) and once with PBS, then fixed in 2 % methanol-free formaldehyde (Thermo Fisher Scientific) in PBS for 15 min at 37 °C. The fixation was quenched with 1 % BSA in PBS, followed by two washes in PBS. Cells were incubated with Accutase (Gibco) for 1 h at 37 °C to facilitate detachment, then scraped, transferred to 15-mL tubes, and washed twice with Cell Staining Buffer. Pellets were resuspended in 100 U/mL Benzonase (Thermo Fisher Scientific) in pre-warmed Cell Staining Buffer and incubated for 5 min at 37 °C. Cells were then washed once with 1 % BSA in PBS and once with Cell Staining Buffer, passed through a 30-µm filter, counted, and kept at 4 °C until barcoding. Barcoding was performed using the Cell-ID 20-Plex Pd Barcoding Kit (Standard Biotools) according to the manufacturer’s protocol. Subsequently, cells were pooled, counted, and prepared for staining.

Antibody staining, measurement, and analysis were performed as previously described (Sell et al, 2023). Details on the antibody panel are provided in Table EV1.

Cell cycle staining

For cell fixation and staining, the following buffers were prepared in advance: 70% ice-cold ethanol, washing buffer (PBS + 1% BSA, 0.2% Triton X-100), and blocking buffer (PBS + 5% BSA, 0.2% Triton X-100). HT29 cells were seeded at 1 × 10^6^ cells per 10-cm dish and allowed to adhere overnight. Cells were then pulsed with 30 µM BrdU (PB) for 20 min. After trypsinization and PBS wash, cells were fixed in 70% ice-cold ethanol and stored at −20 °C. Fixed cells (divided into technical replicates) were washed twice with washing buffer. DNA was denatured by incubating in 2 M HCl + 0.2% Triton X-100 for 30 min at room temperature (RT), followed by another wash. Cells were then blocked for 1 h in blocking buffer and incubated with mouse anti-BrdU antibody (0.5 µg/mL, BD Biosciences) for 1 h at RT. After two washes, secondary staining was performed using Alexa Fluor™ 405-conjugated F(ab)-Goat anti-Mouse IgG1 Fc antibody (2 µg/mL, Thermo Fisher) for 1 h at RT in the dark, followed by two additional washes. For DNA content staining, cells were resuspended in 0.5 mL propidium iodide solution (Thermo Fisher Scientific) in 1 mL 0.1% Triton X-100 in PBS and incubated for 15 min in the dark at RT. Samples were stored at 4 °C overnight in the dark and analyzed the next day by flow cytometry.

Colony formation assay

In total, 2–10 single cells per well were seeded into a 96-well plate (Falcon) in DMEM + 10% FCS and emerging colonies were imaged in the Incucyte (Sartorius) cell culture microscope for 7 days. Average colony size was calculated per well, by dividing the confluency area calculated by the number of colonies on day 7, using Incucyte software.

Single-cell RNA-sequencing data pre-processing

Raw data processing

Raw sequencing reads were processed using cellranger software (v6.1.1) with a reference genome provided by 10X Genomics (refdata-gex-GRCh38-2020-A). Intronic reads were excluded; otherwise, standard parameters were used. Transgenes were assigned to cells from the separate dial-out library using the feature reference mechanism of cellranger with pattern 5PGTGGTGTTAATTAACCTAGAGGGCCCGTTT (BC).

Quality control and filtering

Cells were debarcoded based on the unique nucleotide barcodes integrated into the plasmids. After applying centered log-ratio (CLR) normalization using Seurat’s (Satija et al, 2015; Hao et al, 2021; Stuart et al, 2019) NormalizeData function (v5.0.3) to the barcode counts, manual debarcoding was performed using custom-defined thresholds to assign each cell to its respective oncogene variant. Cells with unassigned barcodes were discarded. Quality control was implemented by filtering out cells with less than 2000 RNA features and more than 15% mitochondrial RNA content. This resulted in a total of 105,349 cells. For most downstream analyses, the data were downsampled to a maximum number of 150 cells per oncogene, line, and replicate, resulting in 48,069 cells.

scRNA-sequencing data analysis

Following quality control, the data were normalized and scaled using Seurat’s NormalizeData and ScaleData functions, respectively. Principal component analysis (PCA) was performed with the RunPCA function, focusing on the most highly variable genes (HVG). The top ten principal components were selected for downstream analysis. Clustering of the cells was conducted using Seurat's FindNeighbors and FindClusters functions. Clusters were identified using the Louvain algorithm with a resolution parameter of 0.5. Uniform Manifold Approximation and Projection (UMAP) was used to visualize clusters in two-dimensional space.

For most downstream analyses, the two replicates from each cell line were merged using Seurat’s merge function. Normalization was re-applied to the merged dataset, followed by identification of variable features, scaling, PCA, and UMAP for dimensionality reduction. Merging was done both separately for each cell line and for all cell lines together. For visualization purposes, integration of replicates was performed using Seurat’s IntegrateData function. Integration anchors were identified with FindIntegrationAnchors, and integrated datasets were processed similarly to merged datasets.

Cell-cycle phase classification was done using Seurat’s CellCycleScoring function with default S and G2M genes from Seurat.

To dissect perturbation effects of the oncogenes, the distance of the perturbation to the control was quantified using the E-Distance metric and assessed for significance with the E-Test, both implemented in the scperturbR package (v0.1.0) (Peidli et al, 2024). To account for imbalances in oncogene representation, the sample-corrected version of the function was used. In addition, pathway activity of all oncogenes was inferred using the progeny package (v1.26.0) (Schubert et al, 2018), considering the top 500 genes. The resulting scores were scaled using Seurat’s ScaleData function.

To perform differential gene expression analysis, the single cells were pseudobulked using the pseudobulk function from glmGamPoi (v1.16.0) (Ahlmann-Eltze and Huber, 2020), with conditions containing fewer than 20 cells excluded, and conditions with more than 150 cells were downsampled. The resulting SingleCellExperiment objects were used for DE testing with DESeq2 (v1.44.0) (Love et al, 2014). A DESeqDataSet was created for each cell line with a design formula accounting for oncogene and replicate. DE results were filtered with an adjusted P value < 0.01 and log2 fold change > 2 or < −2.

Gene Set Enrichment Analysis (GSEA) (Subramanian et al, 2005) was performed to identify pathway-level changes associated with differentially expressed genes across oncogene conditions. In total, 275 gene sets were obtained from 50 different previously published studies and 7 established databases (Dataset EV2). For each cell line, DE genes from the DESeq2 analysis were ranked by log2 fold change. The ranked gene lists were used for enrichment analysis using the fgseaMultilevel function from the fgsea package with default settings (v1.30.0) (Korotkevich et al, 2016). Enriched pathways were identified based on adjusted P values < 0.05.

Gene Set Variation Analysis (GSVA) was performed using the ssgseaParam and gsva functions from the GSVA package (v1.52.0) (Hänzelmann et al, 2013).

Establishment of the model for phenotypic analysis

Latent space representations of single-cell data were computed using scVI with a linear decoder implemented in the LinearSCVI model from the scvi-tools package (v1.1.6.post2) (Gayoso et al, 2022; Lopez et al, 2018; Svensson et al, 2020). Prior to model learning, we performed feature selection using the highly_variable_genes function in scanpy (v1.10.2) (Wolf et al, 2018) based on log-transformed gene expression data, using line and replicate as batch. Raw expression of highly variable genes and a batch key containing line and replicate were input to the scVI model, which was trained for latent dimensions k ranging from 2 to 15, with 250 epochs for each configuration, conditioning the latent distribution to log-normal (latent_distribution='ln'). For each configuration, latent cell representations and gene loadings were extracted and stored for downstream analysis.

The optimal number of latent dimensions was determined by evaluating the robustness of programs across different ranks. Pairwise Jaccard similarity scores were calculated between the top 100 genes (highest loadings) for each dimension as follows: $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$J\left(A,B\right)=\frac{{|A}\cap {B|}}{{|A}\cup {B|}}$$\end{document}$ , where A and B are the sets of top genes for two latent programs, $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{\rm{|A}}}}\cap {{{\rm{B|}}}}$$\end{document}$ is the number of shared genes, and $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{\rm{|A}}}}\cup {{{\rm{B|}}}}$$\end{document}$ is the total number of unique genes. Robust programs were defined as those with a Jaccard similarity score exceeding the 95th quantile of all pairwise similarity scores across dimensions: $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$T={Q}_{95}{\left(\left\{J\left({A}_{i},{A}_{j}\right)\left|i \, \ne \, j\right.\right\}\right)}^{{{{\rm{th}}}}}$$\end{document}$ , where $\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${Q}_{95}$$\end{document}$ is the 95k quantile of the similarity score distribution. Programs meeting the threshold k were selected as robust. The lowest rank k with the highest number of robust programs was chosen for downstream analyses.

TMO characterization

To assess the relationships between scVI-derived modules, termed transcriptional modules unlocked by oncogenes (TMOs), pairwise Pearson correlation coefficients were computed across TMOs using the gene weight matrix, resulting in a TMO-to-TMO correlation matrix. To identify clusters of functionally related TMOs, hierarchical clustering was performed on the transformed correlation distance.

Characterization of biological functions associated with TMOs was performed with GSEA on the gene weight matrix from the linear decoder. The same gene sets as above were used. For each TMO, genes were ranked by their weight values, and enrichment analysis was conducted using fgseaMultilevel from the fgsea package (Korotkevich et al, 2016). The enrichment results were filtered based on adjusted P values to identify significant pathway associations.

Additional gene signatures were employed for further refinement of TMO characterization: DNA damage and replicative stress signatures (Takahashi et al, 2022; Lange et al, 2011; Liberzon et al, 2015; Meng et al, 2018b; Milacic et al, 2024), gene signatures characterizing TNFα and YAP pathway activity (Ashburner et al, 2000; Milacic et al, 2024; Agrawal et al, 2024; Liberzon et al, 2015; Schaefer et al, 2009; Wang et al, 2018; Gregorieff et al, 2015), a signature for senescence (Saul et al, 2022), a TP53 target gene set (Jeay et al, 2015), and gene signatures for metabolism-associated processes (Milacic et al, 2024; Agrawal et al, 2024; Liberzon et al, 2015; Mootha et al, 2003; Aleksander et al, 2023). All signatures and gene set resources can be found in Dataset EV1.

Pathway activity inference was performed on TMOs using the PROGENy framework (Schubert et al, 2018), similar to the PROGENy analysis conducted for the full dataset. Pathway scores were computed based on the gene weights within each TMO. The scores were then scaled to facilitate comparisons across TMOs.

The threshold for selecting top genes per TMO was determined by calculating the hazard ratio of each TMO in TCGA data (for details, see below) across varying gene set sizes. To refine the analysis, gene sets were filtered to exclude genes classified as immune or stromal (see below) and to retain only those present in the TCGA dataset. The tested gene sets ranged from the top 50–200 genes, and the window of tested genes was shifted by ten genes in each iteration. The optimal threshold was defined as the gene set size yielding the smallest P value. The filtered and unfiltered gene signatures are provided in the Dataset EV2.

Validation datasets

Published RNA-sequencing data of Caco-2 cells with Dox-inducible BRAF, BRAF (V600E) (Klotz-Noack et al, 2020), NRAS (Q61K), NRAS (G12D), and KRAS (G12V) (Kuhn et al, 2021) were pre-processed and log2 transformed where necessary. Samples were analyzed regarding their signature expression of TMO signatures as well as signatures for Inflammatory response, DNA Repair, and Replicative Stress, as well as Epithelial crypt and senescence, as described above, using GSVA’s gsvaParam and gsva function (v1.52.0) (Hänzelmann et al, 2013).

RNA sequence, mutation, and clinical data of the TCGA Colorectal Adenocarcinoma PanCancer Atlas (Hoadley et al, 2018; Ellrott et al, 2018; Taylor et al, 2018; Liu et al, 2018) study were obtained through cBioPortal (Cerami et al, 2012; Gao et al, 2013; de Bruijn et al, 2023). Additional metadata, including microsatellite instability (MSI) status, was obtained via synapse (syn2623706) (Guinney et al, 2015), while iCMS classification data (Joanito et al, 2022) were provided upon request. Further patient-level information was extracted through the TCGAbiolinks package (v2.32.0) (Colaprico et al, 2016). The gene expression data were retrieved as normalized transcript per million (TPM) values using RNA sequencing by expectation maximization (RSEM). Before further analysis, TPM values were log-transformed. To focus on epithelial cell gene expression and minimize the influence of immune and stromal components in biopsy samples, we applied a filtering step using gene lists obtained from (Zenodo ID: 10.5281/zenodo.10692018) (Wei et al, 2024). Genes that were expressed at least two-fold higher in immune or stromal cells compared to epithelial cells were removed from the TCGA dataset. Similarly, TMO signatures were filtered to retain only epithelial genes. TMO expression scores were determined using GSVA’s gsvaParam and gsva function (v1.52.0) (Hänzelmann et al, 2013). Statistical significance of differences of TMO scores across clinical parameters was tested with a Wilcoxon rank-sum test for pairwise comparisons and a Kruskal–Wallis test for multiple conditions. Hazard Ratio analysis was performed using a Cox proportional hazards model via the coxph function from the survival package (v3.7.0) (Therneau, 2021) with the formula: Surv(PFS_MONTHS, PFS_STATUS) ~ TMO_score, where progression-free survival (PFS) was modelled based on TMO expression.

For survival analysis, we used independent validation cohorts previously analyzed by Guinney et al (Guinney et al, 2015), all of which include curated metadata and CMS classifications. Normalized microarray expression data and metadata from 566 patients in the Marisa et al, cohort (Marisa et al, 2013) were obtained from Synapse (syn2019116). Additional cohorts were also accessed via Synapse (syn2181001, syn2178073, and syn2019118) (Sadanandam et al, 2014; Laibe et al, 2012; De Sousa et al, 2013). Microarray IDs were mapped to gene symbols using the AnnotationDbi package (v1.66.0) (Pagè et al, 2024).

To ensure consistency, immune and stromal genes were removed from each dataset as described above, and TMO scores were calculated using GSVA. Cox proportional hazards models were fitted for the four datasets using the formula Surv(rfsMo, rfsStat) ~ TMO_score. For the Marisa cohort, patients were further stratified into high- and low-expression groups for each TMO by using the surv_cutpoint function from survminer (v0.5.0) (Kassambara et al, 2017), which determines optimal cut-off points. These groups were used to produce Kaplan–Meier plots using ggsurvplot. To quantify the relative overlap between groups of patients with high or low TMO scores across two TMOs, we calculated the ratio of the observed number of shared patients to the expected number of shared patients. The remaining three cohorts were combined in a meta-analysis using the meta package (v8.0-2) (Schwarzer, 2007), applying a random-effects model.

Supplementary information

Appendix Table EV1 Peer Review File Data Set EV1 Data Set EV2 Source data Fig. 1 Source data Fig. 5 Expanded View Figures