Unsupervised Machine Learning Reveals Temporal Components of Gene Expression in HeLa Cells Following Release from Cell Cycle Arrest
Tom Maimon, Yaron Trink, Jacob Goldberger, Tomer Kalisky

TL;DR
This study uses unsupervised machine learning to uncover hidden patterns in gene expression over time in HeLa cells after cell cycle arrest.
Contribution
The novel use of unsupervised machine learning to deconvolve gene expression into temporal components reveals dynamic biological processes.
Findings
Two oscillatory gene expression components correspond to G1-S and G2-M cell cycle phases.
A third transient component is linked to early response genes and cervical cancer.
Unsupervised methods can reveal hidden temporal dynamics in biological systems.
Abstract
Gene expression measurements of tissues, tumors, or cell lines taken over multiple time points are valuable for describing dynamic biological phenomena such as the response to growth factors. However, such phenomena typically involve multiple biological processes occurring in parallel, making it difficult to identify and discern their respective contributions at any time point. Here, we demonstrate the use of unsupervised machine learning to deconvolve a series of time-dependent gene expression measurements into its underlying temporal components. We first downloaded publicly available RNAseq data obtained from synchronized HeLa cells at consecutive time points following release from cell cycle arrest. Then, we used Fourier analysis and Topic modeling to reveal three underlying components and their relative contributions at each time point. We identified two temporal components with…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4- —Israel Science Foundation
- —Israel Ministry of Health
- —EU-FP7
- —Data Science Institute at Bar-Ilan University
- —ICRF
- —Israel Ministry of Science
- —Israel Ministry of Justice
- —Israel Cancer Association
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Single-cell and spatial transcriptomics · RNA Research and Splicing
1. Introduction
Complex biological systems such as tissues and tumors contain millions of cells. These cells dynamically change their transcriptional state over time, as multiple gene programs—coordinated sets of genes that work together to perform specific biological tasks—are activated and repressed through a complex network of interactions. Dynamic changes in transcriptional cell states underlie fundamental biological processes, such as embryo development, tissue regeneration, and the onset and progression of cancer. A common approach to characterize cell state dynamics is to perform RNA sequencing at multiple time points, either at the ‘bulk’ or single-cell level, yielding a series of gene expression profiles that represent the underlying cell states, and that usually form continuous trajectories in latent space. However, these expression profiles are typically a combination of multiple time-dependent components, where each component is associated with distinct biological processes and gene programs. A major challenge is to “deconvolve” these temporal components. This involves delineating each component, identifying its associated gene programs, and inferring its relative contribution to the overall gene expression profiles at each time point.
Here, we demonstrate the use of unsupervised machine learning to reveal the temporal components in a simple model biological system. We first downloaded a dataset of bulk RNAseq measurements performed by Dominguez et al. on synchronized HeLa cells at 14 consecutive time points—approximately two cell cycles—following release from cell cycle arrest [1,2]. Then, we performed Fourier analysis and found that most genes have a periodicity of either one or two cycles over time. Next, we used topic modeling, an unsupervised machine learning technique, and found that the series of gene expression profiles can be represented as a mixture of three components, two of which are periodic over time and correspond to the G1-S and G2-M phases of the cell cycle, and a third topic with a transient temporal behavior, that is associated with the immediate early response gene program, regulation of cell proliferation, and cervical cancer. This study demonstrates the potential of machine learning algorithms to deconvolve hidden temporal components in complex biological systems, with potential applications for early disease detection, as well as monitoring disease progression and recovery processes.
2. Results
2.1. RNA Velocity Analysis of Periodically Expressed Genes Reveals a Time Lag Between Spliced and Un-Spliced mRNA
We downloaded a published dataset of RNAseq measurements that were performed by Dominguez et al. [1,2]. In that study, HeLa cells were first synchronized by a double thymidine block. Then, following release from cell cycle arrest, RNA was collected and sequenced at 14 consecutive time points which corresponds to approximately two cell cycles (Figure 1A, Table S1). The authors also identified a set of 67 “core” cell cycle related genes with periodic behavior, and categorized them as “G1-S related” or “G2-M related” according to the stage at which their expression is maximal (Table S1).
We started by manually inspecting the expression levels of spliced and un-spliced mRNA of “core” cell cycle genes such as CCNE2 and UNG (G1-S related) and MKI67 and TOP2A (G2-M related, Figure 1B) [3]. As expected, we observed that the yet-unspliced mRNA typically precedes the spliced mRNA by a single sampling interval, which corresponds to a time interval of approximately 1.5–3 h. We confirmed this by performing a cross-correlation between the spliced and un-spliced mRNA levels averaged over the 67 “core” cell cycle genes (Figure 1C).
2.2. Fourier Analysis Identifies Sets of Genes with Potentially Transient and Oscillatory Behaviors over Time
To test the periodicity of every gene in the transcriptome, we performed Fourier analysis. For each gene we plotted its “dominant frequency”, that is, the frequency with the highest spectral density, versus its “dominant frequency score”, which measures the degree to which the score of this frequency is higher than the scores of the other frequencies (see Section 4 and Figure 2A). We found that the first and second dominant frequencies—that correspond to either one or two whole cycles—contain genes with scores that are higher than those within other dominant frequencies (Figure 2A). Moreover, we observed that genes with dominant frequencies corresponding to three cycles and higher have scores comparable to those derived from randomized datasets, which were generated by randomly shuffling the order of counts for each gene (Figure S1). This suggests that most of the information in our dataset is contained in the expression levels of genes with a periodicity of either one or two cycles, indicating potential transient or oscillatory behavior, respectively.
We next selected genes from the second dominant frequency (with scores > 3) and performed PCA. We found that the samples form a circular trajectory in latent space that corresponds to almost two complete cell cycles (Figure 2B). Moreover, separating the spliced and yet-unspliced mRNA expression profiles results in two circular trajectories in latent space with a phase difference between them (Figure S2). When we selected the genes from the first and second dominant frequencies, we found that the samples form a trajectory in latent space with two patterns of behavior: a transient pattern along PC1 (Figure 2C) and an oscillatory pattern along PC2 and PC3 (Figure 2D,E).
2.3. Topic Modeling Reveals Three Temporal Components, Two of Which Are Periodic, Corresponding to the G1-S and G2-M Phases of the Cell Cycle, and a Third, Transient Component, Related to Immediate Early Response, Regulation of Cell Proliferation, and Cervical Cancer
“Topic models” are used in computer science to describe collections of documents that contain varying proportions of words from a predetermined number k of different “topics” [4,5]. By fitting a topic model to a set of documents, it is possible to discover both the k latent topics (i.e., the probability of occurrence of each word from the vocabulary within a topic) and their proportions in each individual document. In our case we assume that each cell state (gene expression profile) at a given time point is a weighted sum of k hidden components whose proportions vary with time. We therefore treat each cell state at a given time point as a document, each gene in the human genome as a word in the vocabulary, the presence of a specific RNA transcript within a cell state as the occurrence of a specific word in a document, the expression level (number of transcripts) of a gene in a cell state as the number of occurrences of a word in a document, and each latent component as a topic. Topic modeling can thus be used to find the k components (i.e., the probability of observing an RNA transcript from each gene within a component) and their proportions at each individual time point.
We fitted a topic model with k = 3 topics to the 14 gene expression profiles and identified three components/topics with distinct temporal patterns, which we labeled as “k1”, “k2” and “k3” (Figure 3A and Table S2). We found that topics k1 and k3 have an oscillatory behavior over the course of both cycles following release from cell cycle arrest and are periodically over-expressed at specific phases of the cell cycle (Figure 3A,B). Topic k2 however is transiently over-expressed mainly during the first cycle following release from cell cycle arrest. We confirmed that the temporal pattern of each topic is also observed in selected genes associated with it. For example, the genes CCNE2 and UNG whose expression is periodic are associated with topic k1 (Figure 1B and Figure 4A), whereas the genes MKI67 and TOP2A whose expression is also periodic, but with a different phase, are associated with topic k3 (Figure 1B and Figure 4A). Likewise, the genes KIFC3, PHLDB2, and PLK2, which are associated with topic k2, are transiently over-expressed following release from cell cycle arrest (Figure 3C and Figure 4A).
To identify and characterize the topics, we performed Gene Ontology (GO) enrichment analysis. For each topic, we selected genes that are significantly over-expressed and used them as input to ToppGene [6] (Figure 4A, Tables S3–S5). We found that topic k1 over-expresses genes related to the G1-S phase of the cells cycle (e.g., DNA replication and G1 to S cell cycle control) and that topic k3 over-expresses genes that are related to the G2-M phase of the cells cycle (e.g., mitosis and regulation of the G2/M transition). In contrast, topic k2 was found to over-express genes associated with the following gene programs: (i) immediate early response [7,8,9,10] (Table S6 and Figure S3); (ii) regulation of cell proliferation (Table S4), and (iii) cervical carcinoma (Table S4).
We confirmed these observations by calculating the posterior probabilities p(k1|gene), p(k2|gene), and p(k3|gene) for genes belonging to specific gene sets and GO terms (Figure 4C). These probabilities represent the association between these genes and each of the three topics. Indeed, we observed that within the 67 “core” cell cycle genes, the G1-S related genes were associated with topic k1 and the G2-M related genes were associated with topic k3. Likewise, genes related to immediate early response and cervical carcinoma were more strongly associated with topic k2.
3. Discussion
In this study we applied Fourier analysis and Topic modeling to a simple in vitro model system to identify its latent components. We identified two components with a periodic temporal behavior (topics k1 and k3), that correspond to the G1-S and G2-M phases of the cell cycle, and a third component with a transient expression pattern (topic k2) that is associated with the immediate early response gene program, regulation of cell proliferation, and cervical cancer.
Specific examples of genes that are transiently over-expressed following release from cell cycle arrest, and that have a high probability for being expressed in topic k2, are KIFC3, PHLDB2, and PLK2 (Figure 3C and Figure 4A). These genes are known to be associated with cervical carcinoma (Table S4). Moreover, PLK2, a member of the polo-like kinase family, is also known to be associated with positive regulation of the cell cycle [11] and is regarded as an immediate early response gene [10] (Figure S17). Other examples include the genes JUN (C-Jun), FOS (c-Fos), FOSB, and FOSL1 (FRA-1), that are also transiently over-expressed following release from cell cycle arrest (Figure S13). The protein products of these genes are components of the AP-1 transcription factor complex that is thought to be involved in cell proliferation and cancer progression [12,13]. Furthermore, the genes JUN, FOS, and FOSB are also members of the immediate early response gene family. The genes RELB, NFKB1, and NFKB2, are also transiently over-expressed following release from cell cycle arrest (Figure S18). These genes are members of the NF-kB family of transcription factors and are known to induce target genes involved in initiation and progression of cancer [14,15]. We also observed that the apoptosis inhibitors BIRC2 and BIRC3 [16] are transiently over-expressed following release from cell cycle arrest (Figure S27), whereas the gene BIRC5 (Survivin), which is a regulator of the mitotic cell cycle, shows a periodic pattern of expression. Additional examples are shown in Figures S4–S27.
The association of immediate early response genes with topic k2 is not very surprising, since these genes are known to be rapidly and transiently transcribed in response to stimulation of cells by serum or growth factors such as epidermal growth factor (EGF), subsequently promoting phenotypic changes such as proliferation, differentiation, and survival [10]. In our study, it is likely that these genes facilitate cell-cycle reentry, promoting the transition from the quiescent G0 phase into the G1 phase. The association of topic k2 with cervical cancer is less straightforward. One possible explanation for this association is that, following release from cell cycle arrest, the cells enter a transient state lasting approximately twelve hours (samples D1–D7), during which they activate molecular mechanisms to resume the cell cycle. In normal differentiation, cell cycle exit is associated with lineage commitment and is typically regulated by tissue-specific mechanisms [17]. It is plausible that cell cycle reentry in HeLa cells is similarly regulated by mechanisms specific to the cervical cancer from which they were derived.
For some of the 67 “core” cell cycle genes we observed that the spliced and unspliced mRNA levels arrive at their first peak (either maximum or minimum) at the same time (Figure 1B), and only after a few hours the typical time lag between the spliced and un-spliced mRNA expression is acquired. This indicates that initially, during the few hours after release from cell cycle arrest, the spliced and un-spliced transcripts are synchronized in these genes. Additional evidence for this gradual acquisition of time lag between spliced and un-spliced mRNA can be seen for larger sets of genes with periodic behavior (Figure S2A,B). This indicates that in the transient cell state following release from cell cycle arrest, the transcriptional timing mechanisms operate in a way that is different from the normal cycling state. One possible explanation is that during this period, the primary mRNA transcribed from these genes is spliced, processed, and translated with minimal delay in order to rebuild the necessary mechanisms for cell cycle re-entry as soon as possible. According to this hypothesis, splicing mechanisms participate in the immediate early response mechanism. Further studies at higher time resolution are required to systematically test this hypothesis and inspect the relationship between mRNA splicing and the immediate early response gene program.
4. Materials and Methods
4.1. Datasets and Preprocessing
The RNAseq dataset generated by Dominguez et al. [1,2] was downloaded from GEO (accession number GSE81485) using SRA tools (version 2.9.6). STAR [18] (version 2.7.3a) was used to align the reads to the human reference genome (hg38) and to obtain the raw counts matrix. DESeq2 [19] (version 1.36.0) was used to obtain the normalized counts matrix. No further batch-effect correction was applied since all samples were generated within the same experiment and laboratory, which minimized technical variability. Since low-count data is often disproportionately affected by technical noise (e.g., mapping errors) and is generally less suitable for correlation analysis, we removed genes whose total normalized counts across all 14 samples were below 10.
4.2. RNA Velocity
Velocyto [3] (https://velocyto.org/, accessed on 8 June 2018), a package for RNA velocity analysis, was used to obtain the read counts of the spliced and unspliced mRNA for each sample. The package was used to annotate and count the STAR-aligned reads. Broadly speaking, each read was annotated as “spliced” if it mapped only to exonic regions of known transcripts, or “unspliced” otherwise, i.e., if it mapped to an exon-intron boundary or entirely to an intron.
Due to processing and memory constraints, we diluted the reads to 20% using Samtools (version 1.19.2) before running Velocyto. Since read dilution is random, this should not affect genes with high expression levels, such as most cell–cycle-related genes analyzed in this study. The specific command line used to run Velocyto was: “velocyto run_smartseq2-o./velocyto_out/-e my_analysis./bam_files_diluted_0.2/* genes.gtf”. This was our sole use of the Velocyto package, and we did not employ any of the downstream R-based modeling tools (e.g., for estimation of the time derivative of gene expression).
Note that, since we are dealing with bulk RNA-seq rather than single-cell data, this analysis is only applicable to gene sets whose expression is assumed to be synchronized across all cells in the sample. This assumption is reasonable in our case, as the cells were synchronized using a double thymidine block. Our analysis is similar to that of La Manno et al. [3], who analyzed bulk RNA-seq time-course data from mouse liver during the circadian cycle and observed that unspliced mRNA levels of circadian-associated genes at each time point resembled the spliced mRNA levels at the subsequent time point (Figure 1e in La Manno et al. [3]).
DESeq2 normalization was performed for the spliced and unspliced counts matrixes combined.
4.3. Fourier Transform
For each gene, a Fourier analysis was performed using the R function “periodogram()” from the “TSA” (Time Series Analysis) R package (version 1.3.1). This function returns a vector of frequencies and a vector of scores (spectral densities) for each frequency. For each gene we found the “dominant frequency”, that is, the frequency with the highest score, and calculated a “dominant frequency score” which measures the degree to which the score of this frequency is higher than the scores of the other frequencies. Specifically, if for a particular gene the highest score is S1 and its matching frequency is F1, the second highest score is S2 and its frequency is F2, the third ranking score is S3 and its frequency is F3 etc., then the “dominant frequency” for that gene is F1, and the “dominant frequency score” is: . The significance of the dominant frequency scores was tested by comparing them to scores derived from randomized datasets, which were obtained by randomly shuffling the order of counts for each gene.
4.4. Topic Modeling
Topic modeling was performed on the normalized counts matrix using the R functions fit_topic_model(), diff_count_analysis(), and structure_plot() from the “fastTopics” R package [20] (version 0.6–135). We fitted a topic model with k = 3 topics that we labeled as “k1”, “k2”, and “k3”. Using these functions, we calculated the three probability distributions , , and for every gene in each one of the three topics, as well as the probabilities , , and for each topic in every one of the fourteen samples. We also tested a higher number of topics, but the results did not change significantly.
One way to visualize the association between a specific gene and each of the three topics is to calculate the “posterior probabilities” , , and and compare them. In order to do this, we used Bayes’ theorem:
where for simplicity, we set the prior for each topic.
4.5. Gene Ontology (GO) and Gene Set Enrichment Analysis
GO enrichment analysis was performed with Toppgene [21] (https://toppgene.cchmc.org/, last accessed on 8 September 2024).
Gene set enrichment analysis was performed with GSEA [22] (https://www.gsea-msigdb.org/gsea, last accessed on 11 June 2024). Genes were ranked in descending order according to the differences in their posterior probabilities between topic k2 and topics k1 and k3, using the option “Phenotype labels: Create on-the-fly phenotype by sample names”. The “Gene sets database” was defined as the set of immediate early genes (Table S6), the “Number of permutations” was set to 1000, the “Permutation type” was set to “gene_set”, and the “Metric for ranking genes” was set to “Diff_of_classes” (see Figure S3 for run details).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Dominguez D. Tsai Y.H. Weatheritt R. Wang Y. Blencowe B.J. Wang Z. An Extensive Program of Periodic Alternative Splicing Linked to Cell Cycle Progressione Life 20165 e 1028810.7554/e Life.1028827015110 PMC 4884079 · doi ↗ · pubmed ↗
- 2Dominguez D. Tsai Y.H. Gomez N. Jha D.K. Davis I. Wang Z. A High-Resolution Transcriptome Map of Cell Cycle Reveals Novel Connections between Periodic Genes and Cancer Cell Res.20162694696210.1038/cr.2016.8427364684 PMC 4973334 · doi ↗ · pubmed ↗
- 3La Manno G. Soldatov R. Zeisel A. Braun E. Hochgerner H. Petukhov V. Lidschreiber K. Kastriti M.E. Lönnerberg P. Furlan A. RNA Velocity of Single Cells Nature 201856049449810.1038/s 41586-018-0414-630089906 PMC 6130801 · doi ↗ · pubmed ↗
- 4Dey K.K. Hsiao C.J. Stephens M. Visualizing the Structure of RNA-Seq Expression Data Using Grade of Membership Models P Lo S Genet.201713 e 1006599 Correction in P Lo S Genet. 2017,13, e 100675910.1371/journal.pgen.100659928333934 PMC 5363805 · doi ↗ · pubmed ↗
- 5Blei D.M. Ng A.Y. Jordan M.I. Latent Dirichlet Allocation J. Mach. Learn. Res.200339931022
- 6Chen J. Bardes E.E. Aronow B.J. Jegga A.G. Topp Gene Suite for Gene List Enrichment Analysis and Candidate Gene Prioritization Nucleic Acids Res.200937(Suppl. S 2)W 305W 31110.1093/nar/gkp 42719465376 PMC 2703978 · doi ↗ · pubmed ↗
- 7Herschman H.R. Primary Response Genes Induced by Growth Factors and Tumor Promoters Annu. Rev. Biochem.19916028131910.1146/annurev.bi.60.070191.0014331883198 · doi ↗ · pubmed ↗
- 8O’Donnell A. Odrowaz Z. Sharrocks A.D. Immediate early Gene Activation by the MAPK Pathways: What Do and Don’t We Know?Biochem. Soc. Trans.201240586610.1042/BST 2011063622260666 · doi ↗ · pubmed ↗
