Predicting Growth Rate from Gene Expression

Thomas P. Wytock; Adilson E. Motter

arXiv:1901.05010·q-bio.CB·January 23, 2019

Predicting Growth Rate from Gene Expression

Thomas P. Wytock, Adilson E. Motter

PDF

1 Repo

TL;DR

This study develops a machine learning model that predicts microbial growth rates from gene expression data, achieving high accuracy and reducing feature complexity, which can guide experimental optimization.

Contribution

It introduces a novel approach using k-nearest-neighbors regression to accurately predict growth rates from gene expression with significant feature reduction.

Findings

01

Predicts E. coli growth rate with 81% variance explained

02

Predicts S. cerevisiae growth rate with 89% variance explained

03

Reduces gene expression features from thousands to fewer than twenty

Abstract

Growth rate is one of the most important and most complex phenotypic characteristics of unicellular microorganisms, which determines the genetic mutations that dominate at the population level, and ultimately whether the population will survive. Translating changes at the genetic level to their growth rate consequences remains a subject of intense interest, since such a mapping could rationally direct experiments to optimize antibiotic efficacy or bioreactor productivity. In this paper, we directly map transcriptional profiles to growth rates by gathering published gene-expression data from Escherichia coli and Saccharomyces cerevisiae with corresponding growth-rate measurements. Using a machine-learning technique called k-nearest-neighbors regression, we build a model which predicts growth rate from gene expression. By exploiting the correlated nature of gene expression and sparsifying…

Tables1

Table 1. Table S1: Correlation between the selected eigengenes in E. coli and growth rate. The columns are: the eigenvalue of the selected eigengene, the corresponding rank by the absolute value of the Spearman correlation between selected eigengenes and growth rate (among all eigengenes), and the value of correlation coefficient. Lower rank numbers correspond to larger correlation (in absolute value). “Including Noise” lists the eigenvalues selected when noise is considered during the feature selection.

Selected Eigenvalue	Rank	Spearman Correlation
0.053516	1829	-0.040
0.052508	990	0.157
0.060603	310	-0.264
0.060767	2125	0.002
0.159468	1994	-0.019
0.077524	1287	-0.113
0.095336	1355	0.105
0.098494	768	-0.188
0.079097	542	-0.222
(Including Noise)
32.221929	50	0.354
3.039641	1896	-0.032
3.114244	1519	0.083
1.488373	1081	-0.143
3.386905	1963	0.023

Equations10

C = PD P^{- 1},

C = PD P^{- 1},

P^{- 1} = P^{T} .

P^{- 1} = P^{T} .

ν^{i} = P^{T} v^{i} .

ν^{i} = P^{T} v^{i} .

F (ν, g; ∣ S ∣) = S arg min i \sum E (G^{(S)} (ν^{i}) - g^{i})^{2} + λ (lo g \frac{N _{p}^{2}}{N _{g} N _{ν}})^{2} .

F (ν, g; ∣ S ∣) = S arg min i \sum E (G^{(S)} (ν^{i}) - g^{i})^{2} + λ (lo g \frac{N _{p}^{2}}{N _{g} N _{ν}})^{2} .

S^{l} = A^{l} a_{l} .

S^{l} = A^{l} a_{l} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

twytock/MI-POGUE
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Predicting Growth Rate from Gene Expression

Thomas P. Wytock

Department of Physics and Astronomy, Northwestern University, Evanston, IL 60208, USA

Adilson E. Motter

Department of Physics and Astronomy, Northwestern University, Evanston, IL 60208, USA

Northwestern Institute on Complex Systems, Evanston, IL 60208, USA

Physical Sciences-Oncology Center, Northwestern University, Evanston, Illinois 60208, USA

Abstract

Growth rate is one of the most important and most complex phenotypic characteristics of unicellular microorganisms, which determines the genetic mutations that dominate at the population level, and ultimately whether the population will survive. Translating changes at the genetic level to their growth rate consequences remains a subject of intense interest, since such a mapping could rationally direct experiments to optimize antibiotic efficacy or bioreactor productivity. In this paper, we directly map transcriptional profiles to growth rates by gathering published gene-expression data from Escherichia coli and Saccharomyces cerevisiae with corresponding growth-rate measurements. Using a machine-learning technique called $k$ -nearest-neighbors regression, we build a model which predicts growth rate from gene expression. By exploiting the correlated nature of gene expression and sparsifying the model, we capture 81% of the variance in growth rate of the E. coli dataset while reducing the number of features from over 4,000 to nine. In S. cerevisiae, we account for 89% of the variance in growth rate while reducing from over 5,500 dimensions to 18. Such a model provides a basis for selecting successful strategies from among the combinatorial number of experimental possibilities when attempting to optimize complex phenotypic traits like growth rate.

Wytock, T. P., & Motter, A. E. (2019). Predicting growth rate from gene expression. Proc. Natl. Acad. Sci. USA, 116(2), 367–372

keywords:

biological networks $|$ machine learning $|$ systems biology $|$ metabolic networks $|$ data science

Significance summary

Connecting genetic changes to organismal function has been a central problem of biology for decades. Understanding the genetic underpinnings of functional traits like growth rate remains incomplete despite efforts to uncover metabolic and gene regulatory networks. Here, we leverage correlations derived from large-scale datasets of E. coli and S. cerevisiae to construct a mapping between gene expression and growth using the $k$ -nearest neighbors technique. Our mapping can predict growth rate more accurately than previous methods, while compressing gene-expression data from thousands of genes to tens of features without requiring network structure identification. This model can be applied to generate hypotheses, design experiments, and reduce the amount of trial and error in research.

Introduction

Mapping genotype to phenotype remains a central challenge in molecular biology. In the past two decades, complex networks have emerged as a tool to organize the vast amount of data generated by genomic technologies toward mapping biochemical patterns to whole-system function. Applications include the meta-analysis of genetic interactions across organisms to find biologically conserved structures Ravasz2002 , the network analysis of gene annotation relationships to interpret the expression changes in gene sets Glass2012 ; Ku2012 , the curation of genetic relationships into gene-regulatory networks to predict the outcome of proposed interventions Zanudo2015 ; Cornelius2013 , and the reconstruction of metabolic networks on which metabolic capacity is calculated through constraint-based models (CBMs) Edwards2000 ; Forster2003a ; Segre2002 . These tools have also been used to explain gene essentiality and epistasis Forster2003b ; Gerdes2003 ; Segre2005a , and to study disease progression and treatment Zanudo2015 ; Schlauch2017 .

Though successful, these strategies all rely on the labor-intensive task of determining the network of relationships between genes. They resolve the network structure through a combination of aggregation of prior knowledge and targeted experimentation, but the benefit that network models provide by structuring data are limited by the problems they address or the conditions under which they apply. For example, annotation methods are associative and qualitative, which limit the potential for causal attribution and inter-study comparison, respectively. Meanwhile, precise dynamic models of gene-regulatory networks often require the (challenging) measurement of in vivo kinetic parameters or other condition-specific quantities to validate the dynamical rules. On the other hand, CBMs require experiments with well-defined media and measurements of metabolic uptake rates for flux-balance analysis (FBA) to yield accurate maximal rates of biomass production. In addition, FBA assumes that the cell directs its metabolic activities to maximize cell growth and is fully adapted to its environment both before and after a perturbation Edwards2000 ; Fong2004 , although alternate methods have been developed that relax this restrictive requirement on the final state Segre2002 ; Shlomi2005 or more generally Mahadevan2003 .

In this paper, we establish a complementary method to predict growth rate using only gene-expression data, which we refer to as Model-Independent Prediction Of Growth Using Expression (MI-POGUE). Even though our method focuses on growth rate, it provides a novel strategy to answer the more general questions of how whole-cell gene expression affects phenotypic changes and thus of how to convert genome-wide observations into quantitative phenotypic predictions. The novelty and flexibility of MI-POGUE derive from using an effective model of genetic interactions in lieu of relying on prior knowledge or specialized experiments.

We develop MI-POGUE by retrieving large datasets of gene expression and growth rate in E. coli Carrera2014 and S. cerevisiae Hughes2000 ; Airoldi2009 ; Airoldi2016 ; Slavov2011 ; Lu2009 ; Kemmeren2014 . Comprising thousands of individual observations, these datasets allow the direct measurement of gene-gene correlations present in cells, which form the basis of an effective model of genetic regulation. In our approach, we transform the gene-expression data into weighted combinations of genes derived from the gene-gene correlations called “eigengenes” Alter2000 , and predict growth rate by averaging the growth rates associated with the gene-expression profiles most similar to a given target profile—a technique known as $k$ -nearest neighbors (KNN) regression Altman1992 . The efficacy of MI-POGUE is substantiated by comparing it with state-of-the-art methods for predicting growth rate.

The data-driven conception of MI-POGUE sidesteps the network identification problem while still accounting for all observed changes to intracellular networks in response to perturbations. Given that nonlinearity allows small changes in part of the cell to effect large changes in another part, broad-based strategies like MI-POGUE that account for changes across the whole genome promise to open new lines of inquiry in investigating fundamental systems biology, as well as in engineering microorganisms, and designing antibiotics.

Results

Dataset overview

MI-POGUE requires a number of paired gene-expression and growth-rate measurements large enough to form a representative sample of potential organismal growth conditions to provide accurate growth estimates. We apply MI-POGUE to both E. coli and S. cerevisiae. Table 1 establishes that the size of the dataset we consider is unusually large for each organism.

The E. coli dataset is derived from ref. Carrera2014 and includes a broad sample of environmental conditions, measuring the effects of heat shock, hypoxia, or adaptive evolution on a variety of carbon sources in addition to over 150 genetic perturbations. The primary substrains of E. coli K12 featured in these experiments are MG1655 and BW25113. The S. cerevisiae dataset, which serves to demonstrate its applicability to eukaryotes, comprises experiments performed in chemostats with various environmental stresses and nutrient limitations with gene-expression data taken from refs. Hughes2000 ; Airoldi2009 ; Airoldi2016 ; Slavov2011 ; Lu2009 ; Kemmeren2014 and growth rate taken from refs. Airoldi2009 ; Airoldi2016 ; Slavov2011 ; Lu2009 . The metadata annotating the experiments is curated from these references, as described in the Methods, and they are provided with MI-POGUE’s source code Wytock2018 .

Eigengene estimation

The expression between genes is highly correlated Alter2000 , implying that each gene’s expression depends on its neighbors in the network. Here, we derive eigengenes, which are combinations of genes that reorganize expression according to interdependencies implicitly mediated by the gene regulatory network. Let $g^{i}$ and $\vec{v}^{\,i}=\{v_{j}\}^{i}$ be the growth rate and gene expression, respectively, of the $i^{th}$ experiment, and let $j$ be an index over genes, and let $\vec{g}=\{g^{i}\}\ i=1..E$ and $\vec{v}=\{\vec{v}^{\,i}\}\ i=1..E$ be the set of all growth-rate measurements and their associated gene-expression profiles, respectively. Furthermore, let $E$ be the total number of experiments including a growth-rate measurement, $N$ be the number of gene-expression measurements used to estimate correlations, and $J$ be the total number of genes common to each measurement.

We estimate the correlations between genes based on all the available expression data (regardless of whether it had associated growth rate or not) and calculate the eigenvectors. Briefly, we compute the gene-gene (Pearson) correlation matrix $\mathbf{C}=C_{jk}$ where $C_{jk}$ is the Pearson correlation coefficient between the expression of the $j^{th}$ and $k^{th}$ genes. The correlation matrix is square ( $J$ by $J$ ) and symmetric under exchange of indices. We diagonalize the matrix,

[TABLE]

resulting in the matrix $\mathbf{P}=P_{lj}$ in which each column $\mathbf{P}^{l}$ corresponds to the $l^{th}$ eigenvector while each row $\mathbf{P}_{j}$ reflects the $j^{th}$ gene’s projection onto the set of eigenvectors. The diagonal matrix $\mathbf{D}=(D_{ll})$ indicates the amount of correlation occurring along the $l^{th}$ column of $\mathbf{P}$ . Because the correlation matrix is symmetric, we have

[TABLE]

Any expression profile may be projected onto the correlation eigenvectors by matrix multiplication:

[TABLE]

We call the $\vec{\nu}^{\,i}$ eigengenes, a portmanteau of “eigen” (proper) and “gene” Alter2000 , because they represent independent (that is, non-redundant) variations in gene-expression space. Therefore, increasing or decreasing the magnitude of one eigengene’s expression ( $\nu_{l}^{i}$ ), leaves the other projections unchanged ( $\nu_{m}^{i}\,,m\neq l$ ). In contrast, changing the expression of the $j^{th}$ gene, $v_{j}$ , would result in changes in other genes, modulated by $\mathbf{P}_{j}$ , allowing this change to have wide-ranging impacts across the gene-expression profile.

Restriction of KNN models to the most informative eigengenes

The method of KNN regression is a machine-learning technique trained on a set of paired measurements of independent and dependent variables that assigns an average of selected dependent variables to a test measurement of independent variables, where the dependent variables are selected by testing independent variables’ similarity with the training measurements. Here, the dependent variable is growth rate and the independent variables are the elements of a gene-expression profile as illustrated in Fig. 1A. The output of the KNN-fitting process is called a regressor, which takes gene expression as input and outputs an estimate of growth rate. From the set of all calculated eigengenes, we restrict to those with greatest potential to inform growth rate by searching for eigengenes that vary (possibly non-linearly) the most as growth rate changes. We discretize each measurement into bins of both growth rate and gene expression, and search for eigengenes that most evenly distribute the experiments into bins and thus span the range of variation observed.

Figure 1B illustrates in grids the joint distributions between growth rate and gene expression for two eigengenes. In this example, we suppose that each eigengene (and growth rate) can be in one of three states and discretize expression into these bins. Bins with higher densities of observations have darker colors. Any single eigengene places limited constraints on the possible values of growth rate, but by adding more eigengenes the growth-rate possibilities for a given eigengene-expression profile narrow. The increased specificity comes at the cost of increasing the number of possible states an experiment could occupy, thereby increasing the sensitivity to noise (as diagrammed in Fig. 1C). Models that incorporate enough eigengenes to estimate growth rate but avoid overfitting maximize their ability to accurately explain the observed data and while retaining the ability to predict new data.

Optimization of KNN regressors

To select features that predict growth, we require an objective function that balances the explanatory and predictive capabilities of the regressor, called $G$ . We first quantify the explanatory capabilities of $G$ , which is characterized by the set of features $S$ that define the gene-expression subspace in which neighboring experiments are determined. The argument of $G^{(S)}$ is an experiment, $\vec{\nu}^{\,i}$ , where as before $i$ is an index over experiments. We then determine the accuracy of $G^{(S)}(\vec{\nu}^{\,i})$ using the squared difference with the experimentally measured growth rate $g^{i}$ .

As the number of eigengenes $|S|$ in $S$ increases, the predictions converge toward the measured growth rate, but the rate of convergence slows as models incorporate $|S|>10$ eigengenes. Therefore, we introduce a criterion to quantify how efficient $G^{(S)}$ is in terms of state space. This term measures whether the decrease in error is large enough to justify the addition of another eigengene. In a maximally efficient model, each unique combination of eigengene-expression levels would have a corresponding range of growth rates, with no expression combination excluded. In other words, the number of bins occupied by the experiments in the dataset ( $N_{p}$ ) would be equal to the number of growth rate bins in Fig. 1B ( $N_{g}$ ). At the same time, $N_{p}$ would also be equal to the total possible number of configurations $N_{\nu}=\prod_{1}^{|S|}N_{j}$ , where $N_{j}$ is the number of bins for the $j^{th}$ eigengene. We take the square of the (natural) logarithm of each ratio to obtain $\log[N_{p}^{2}/(N_{g}N_{\nu})]^{2}$ , which we refer to as the state-space occupancy. Finally, we introduce the regularization parameter $\lambda$ to balance the relative contribution of the explanatory and predictive terms, yielding:

[TABLE]

The value of $\lambda$ at the optimal value of $|S|$ is case-dependent and empirically found to be near $0.05$ for E. coli (Methods and Fig. S1) and $0.001$ for S. cerevisiae. Asymptotically, the optimal $|S|$ shifts toward smaller numbers as $\lambda\rightarrow\infty$ and toward larger numbers as $\lambda\rightarrow 0$ .

We compare the performance of the various $G^{(S)}$ with Eq. (4) by dividing the dataset into training and test data consisting of gene-expression profiles paired with growth rate. We choose to employ “stratified, five-fold cross-validation” which divides the existing data into subsets, called “folds,” whose distribution of growth rates is constrained to match the distribution of the entire dataset as closely as possible. In testing the generalizability of the regressor, the dataset comprises five equally sized folds, and four folds are used as training data to fit the regressor, which is tested on the fifth. Cross-validation is repeated with each fold used as test data once, yielding predictions for each of the $E$ experiments in the dataset. To account for variability in predictions due to fold construction, we average the predictions over 100 divisions of the dataset.

Finding the optimal set of eigengenes requires the testing of each possible set $S$ , the number of which grows combinatorially. In view of the huge number of possibilities, we employ the “forward-selection” heuristic, which builds the set $S$ by adding eigengenes one at a time to find a set that is close to optimal efron2004least . Starting with $|S|=1$ , and continuing for each size of $|S|$ , we rank all candidate regressors by the cross-validation procedure described in the previous paragraph. Next, we take the top-ranked set of eigengenes and form candidate sets of features of size $|S|+1$ by adding each of the remaining features to the top-ranked set of eigengenes. The cross-validation process continues until $|S|=50$ , at which point the root mean squared error (RMSE) of the predictions has stopped improving.

With the process for evaluating regressors in mind, we detail the optimal regressors for each value of $|S|$ in Fig. 2. Strikingly, the models including 9 and 18 eigengenes out of thousands achieve a better balance of accuracy and predictability than by including much larger gene-based models in both E. coli and S. cerevisiae (Table 2). The small number of features at the minimum compresses the gene-expression information relevant to growth rate into a relatively low-dimensional subspace thereby facilitating further analysis.

For the best regressor composed of each number of eigengenes, we examine the trend in the square root of the sum of squared errors (SSE) (Fig. S2A)) and the state-space occupancy (Fig. S2B) corresponding to the first and second terms of Eq. (4), respectively. The SSE trend in Fig. S2A shows the stagnating improvements in accuracy, despite the geometric decrease in the state-space occupancy by observations in Fig. S2B. The large unoccupied fraction of state space hampers predictability as it is unclear how to extrapolate to hypothetical observations in this region.

Prediction comparison with existing methods

In Table 2, we compare the quality of predictions based on eigengenes with those based on the precursors of biomass—that is, all the genes included in the metabolic reconstruction for each organism. We obtained the precursors of biomass from iJO1366 Orth2011 , containing 1,352 genes for E. coli, and from Yeast 7 Aung2013 , containing 897 genes for S. cerevisiae, and applied MI-POGUE to predict growth rate based on these genes’ expression only. For both organisms, models built on eigengenes have a higher coefficient of determination ( $R^{2}$ ) and lower RMSE than those built on precursors of biomass, despite requiring fewer features.

Given the overall growth-rate prediction accuracy in both organisms, we investigate the accuracy at the level of the individual experiments in Fig. 3A. MI-POGUE performs comparably across most of the strains present in our dataset. In E. coli, 348 of 589 of the predicted experiments fall inside the 5% error (grey region), despite systematic underestimation of the fastest and overestimation of the slowest growth rates. S. cerevisiae shows a similar pattern, with a slightly higher fraction of experiments with less than 5% error (66 of 107). The KNN approach systematically overestimates the slowest growth rates and underestimates the fastest growth rates because, by construction, the nearest neighbors of the slowest growth state will have faster-growth neighbors and vice-versa.

Nevertheless, the level of accuracy motivated us to compare MI-POGUE with other existing methods using $R^{2}$ as a metric in Table 3. MI-POGUE achieves agreement with experimental measurements superior to other reported methods without need for the additional step of converting the value of the biomass objective function to growth rate. MI-POGUE’s accuracy outpaces that of CBMs without requiring metabolic uptake rates or enzyme kinetic parameters.

We demonstrate the flexibility of our method by applying it to predict the growth rate of S. cerevisiae grown in chemostats (Fig. 3B). We also predicted growth using the linear models from ref. Airoldi2009 for comparison (SI Methods). With $R^{2}=0.89$ , MI-POGUE far outpaces the moderate success of the linear growth model for S. cerevisiae Airoldi2009 , which has $R^{2}=0.30$ . Compared to E. coli, MI-POGUE performs slightly better in S. cerevisiae, despite the additional complexity of mapping gene expression to function Ku2012 . This improvement in performance may be attributed to the relative sizes of the datasets and diversity of the conditions considered.

The success of MI-POGUE compared to the other methods derives, in part, from using eigengenes instead of single genes, but at first one might think that this choice obscures the biological role of eigengenes. We note, however, that Weighted Gene Coexpression Network Analysis (WGCNA) Langfelder2008 has recently been employed in fungi to associate gene modules with qualitative growth states Baltussen2018 . Our approach is quantitative rather than qualitative but we can borrow this tool to interpret our eigengenes. Specifically, we develop a method to associate biological functions with eigengenes in E. coli using WGCNA and Protein Analysis Through Evolutionary Relationships (PANTHER) Mi2017 . For each selected eigengene, we take the outer product of the eigengene with itself, resulting in a $J$ by $J$ similarity matrix that is rescaled such that the largest diagonal element is one (see Methods for full details). The rescaled matrix is subjected to WGCNA, yielding a module of genes associated with each eigenvector. We use PANTHER to identify the most over- and underrepresented Gene Ontology (GO) Biological Process annotations in each module (defined by $p<0.05$ , Fisher’s Exact Test).

We find that the top GO terms associated with modules derived from five of the nine eigengenes are overrepresented for polysaccharide, phospholipid, lipid, fatty acid, and amino acid metabolism. In two others, DNA repair and DNA metabolism are overrepresented. Of the remaining two, one is defined by its underrepresentation of metabolic genes, while the other has no terms meeting the $P$ -value threshold—reflecting the sometimes weak association between eigengenes and GO terms. That the nine selected eigengenes have largely nonoverlapping annotations is a result of the forward-selection process. Once an eigengene that captures one biological process is selected, it is less likely that a second eigengene capturing the same process will be selected. The full gene lists and annotation lists are reported in Tables S2 and S3.

Discussion

MI-POGUE both addresses the challenges faced by previous methods and reduces the labor necessary to construct models that map gene expression to phenotype. It fully incorporates gene expression, relaxes the requirement that organisms be completely adapted to their environment, reduces reliance on metabolic uptake rates, and avoids the necessity of estimating enzyme kinetic parameters Teusink2000 ; VanEunen2012 ; Chubukov2014 . Furthermore, MI-POGUE can be applied broadly, even in cases where the growth media are not strictly defined. Because MI-POGUE is flexible, it can repurpose previous measurements without requiring extensive targeted experiments to determine the structure of intracellular networks. Its ability to reduce the relevant features to a small number of eigengenes allows for genome-wide data to be expressed succinctly without loss of predictive power. These advantages are achieved while simultaneously improving the capacity to predict growth rate.

MI-POGUE can characterize the independence, synergy, or antagonism of perturbation pairs by evaluating the growth rate of a hypothetical transcriptional state constructed by adding the (experimentally measured) transcriptional responses of two perturbations to a reference state. Whereas local models leave open the possibility that some unobserved gene accounts for growth deviations from independence, our method implies that, within limitations of the available data, such deviations result from non-genetic mechanisms PenalverBernabe2016 .

The approach of generating proposed states based on experimentally measured gene-expression responses to perturbations can also pre-screen experimental hypotheses. Such screening has the advantage of accounting for the real response of cells as opposed to the simulated response based on network structure Alter2000 ; Brauer2008 . In the case of two perturbations where both are genetic, MI-POGUE can be used to estimate the growth of expression profiles resulting from both individual perturbations and the double perturbation, yielding a computational prediction of growth-rate epistasis, thereby providing a new tool to understand it Segre2005a . MI-POGUE could also be integrated with transcriptional regulatory network from databases Gama-Castro2015 to predict the impact of previously unmeasured gene perturbations.

Researchers can incorporate MI-POGUE with existing strategies that interpret genetic networks in order to improve their effectiveness, as we demonstrate by using WGCNA and PANTHER to interpret the biological roles of eigengenes. Genomic footprinting Gerdes2003 , previously used to find essential genes, could be used to resolve regions of gene expression that yield zero growth, which can enhance the ability of the KNN algorithm to extrapolate beyond the training data.

The applicability of MI-POGUE to metabolic engineering, antibiotic development, and systems biology is expected to encourage its adoption and further refinement or the adoption of similar methods. For example, metabolic engineers could tailor MI-POGUE to offer predictions of a key uptake or secretion rate based on the organism’s gene expression. Antibiotic developers could use transcriptional changes in response to drugs to choose combinations that result in the slowest growth rate as predicted by MI-POGUE. Systems biologists could use MI-POGUE to look for interactions between genes by taking transcriptional responses to single knockouts, adding them, and simulating the outcome. We also note that the mapping of eigengenes to biological functions as described here merit further investigation. Using modern community-detection algorithms, such as weighted stochastic block models Aicher2014 , can help discern finer-scale structure of gene modules

Given that cells are complex systems, weak and indirect interactions at the molecular level can influence behavior at the whole-cell level. Reductionist strategies are poorly suited to study these phenomena, but approaches like MI-POGUE, which combine machine learning with bioinformatic “big data,” have the potential to capture these subtle effects. As systems biologists adapt machine-learning techniques to better interpret high-throughput data, the new interpretative power of these techniques has the potential to reveal under-appreciated and sometimes counter-intuitive effects that will drive the field into the future.

Methods

Implementation of MI-POGUE

We used an implementation of the nearest-neighbors algorithm Altman1992 found in the Python sklearn package Pedregosa2011 . For E. coli (S. cerevisiae) models, we chose $k=7$ ( $k=8$ ) to be the number of neighbors as this number was shown to perform better than other choices (SI Appendix, Figs. S3 and S4). Additional details about the development and extensions of MI-POGUE are described in the SI Appendix, with the code and instructions for running MI-POGUE is available in ref. Wytock2018 .

Discretization of data

The growth-rate bins are fixed so that the number of experiments in each bin is approximately the same. In addition, for a given feature, every tenth percentile (that is, the $10^{th},20^{th},\dots,90^{th}$ ) of the projection of gene expression onto that feature is calculated from the available data. If the difference between consecutive percentiles (that is, the bin width) is larger than 10% of the mean of the consecutive percentiles (the bin midpoint), then the bin is left in place. When the width is smaller, we randomly choose either the previous or subsequent percentile and merge the data into a larger bin, recalculating the width and midpoint. The bin-merging procedure continues until all bins’ widths are larger than 10% of their midpoints.

Precursors of biomass

We downloaded the metabolic model iJO1366 Orth2011 from E. coli and all those from S. cerevisiae considered in ref. Heavner2015 from the supplementary material provided with the associated publications.

Experimental data

For E. coli, we downloaded gene-expression data and experimental metadata from ref. Carrera2014 . A lightly edited version of supplementary table 2 from ref. Carrera2014 describing the full E. coli dataset can be found at the author’s GitHub.

For S. cerevisiae, we downloaded gene-expression data and growth-rate data from ref. Airoldi2009 , packaged as an “.RData” archive in the dataset S1 in that reference. Loading the archive into R, we used the data frames that reported gene-expression data for strains growing at a fixed rate: “frmeDataCharles” Lu2009 and “frmeDataGresham” Airoldi2016 . To these, we added data from ref. Slavov2011 (downloaded from http://genomics-pubs.princeton.edu/grr/), whose growth-rate, but not gene-expression, data are included in the R archive. These three datasets shared 5,527 unique genes and 107 total experiments.

For the purpose of estimating the correlations between genes in S. cerevisiae, we obtained data from two large-scale screens of gene knockouts Hughes2000 ; Kemmeren2014 . The 300 expression profiles of Hughes2000 were used as provided in the RData archive. Raw data from ref. Kemmeren2014 were downloaded from GEO and preprocessed as described in the supplement of that reference. Following ref. Kemmeren2014 , we excluded gene-expression profiles with fewer than four genes with significant responses to the gene deletion. These 1,369 experiments comprising 700 responsive strains were identified from supplemental table S1 of ref. Kemmeren2014 .

Determining optimal parameter values

The optimal number of neighbors was empirically determined in two ways: first by starting at the previously identified set of optimal eigengenes for $k=7$ and cross-validating the dataset with different numbers of neighbors, and second by re-running feature selection with an optimal number of neighbors obtained from the first method. The first case is illustrated by SI Appendix, Fig. S3 for a range of values for $k$ , the number of nearest neighbors. We repeated the feature selection for $k=5$ , determined the maximum for $R^{2}$ of the peak, and found that this performs less well than the $k=7$ case.

We repeated the feature selection for various values of the regularization parameter $\lambda$ , which controls the relative weighting of the prediction error and the state-space occupancy terms of Eq. (4) as shown in SI Appendix, Fig. S1. Since the forward-selection algorithm finds the best-fitting eigengene at each stage, changes to $\lambda$ and the fold divisions can lead to different selections from the available eigengenes. As before, the variability in eigengenes due to fold selection can be mitigated by repeating the selection and taking the eigenvector that performs best on average. In E. coli, the maximal $R^{2}$ achieved for $\lambda=10^{-2}$ in both the $k$ -fold and leave-one-GSE-out cross-validation strategies is greater than that achieved by the other values. Additional tests led to us selecting $\lambda=0.05$ . A similar approach was used to select $\lambda$ for the S. cerevisiae dataset.

Acknowledgements

This work was supported by NIH/NIGMS R01GM113238 and NIH/NCI 1U54CA193419. TPW also acknowledges support from NSF-GRFP fund No. DGE-0824162 as well as NIH/NIGMS 5T32GM008382.

Author contributions

T.P.W. and A.E.M. designed research; T.P.W. performed research; T.P.W. and A.E.M. analyzed data; and T.P.W. and A.E.M. wrote the paper.

Supplemental Information

Overview

The SI Results are an extended description of the parameter estimation, cross-validation, and eigengene interpretation that we performed on MI-POGUE. The SI Methods describe the implementation of the linear models, present additional context for MI-POGUE’s objective function, and provide guidance on how to select the parameter $\lambda$ .

SI Results

Alternative cross-validation strategies

In the main text, we adopt a stratified $k$ -fold cross-validation strategy. This strategy can be thought of an upper bound of MI-POGUE’s accuracy. In E. coli, we additionally investigate the possibility of overfitting using a strategy we call “leave one GSE out” in which all the gene-expression and growth-rate measurements associated with a particular Gene Expression Omnibus Barrett2012 (GEO) Series accession number (i.e., GSE) are withheld from the training set (see black dashed curves reproduced in Figs. S1, S4, S5 and S7). In S. cerevisiae the equivalent method is “leave one group out.” This strategy is the most stringent and is equivalent to applying MI-POGUE to unseen data. In this case, the peak values for $R^{2}$ are in the range of 0.45 for E. coli and 0.36 for S. cerevisiae (green curve in Fig. S6).

Specifically in S. cerevisiae, we implement two more strategies. In the first, we excluded the set of experiments with a particular growth rate (orange curve in Fig. S6). In the second, we excluded all experiments undergoing a particular treatment; for example, all strains grown in phosphate limiting conditions (red curve in Fig. S6). These two cases exhibit different behaviors as the number of features used in the regressor increases. The predictions seem to improve slightly in the case of an excluded growth rate, because the additional features aid interpolation of the growth rate. Conversely, caution is needed when extrapolating to new treatments, as adding more features in this case causes accuracy to decline due to overfitting.

Using genes instead of eigengenes

We sought to establish whether models formed with eigengenes performed better than those formed with genes by performing feature selection in both instances and comparing the resulting models. The results are shown in Fig. S7. In the stratified $k$ -fold case, gene-based models appear to outperform eigengene-based models for models with less than 10 features. However, eigenegene-based models remain preferable because they achieve a higher $R^{2}$ at large feature numbers in the stratified $k$ -fold case, they achieve a higher $R^{2}$ in the leave-one-GSE-out case, and they have a smaller number of features to search through, which reduces the computational time. We note that the selected eigengenes are not individually correlated with growth (Table S1). This is a reflection of the non-linear and non-parametric nature of KNN regression.

Noise sensitivity

The optimal features include eigenvectors associated with small eigenvalues. These small eigenvalues tend to be sensitive to the level of noise included in the features. The effect of noise can be simulated by first calculating the mean and variance of each gene across the 2,196 experiments. This mean and variance are used to define a Gaussian distribution. We use this distribution to generate “pseudo-profiles” (i.e., simulated data) and model the effect of noise by including four pseudo-profiles per experiment in the training set when testing the KNN regressor. Each pseudo-profile is assigned a growth rate that is generated by taking the actual measurement as the mean and imposing a 5% error rate about this mean. The pseudo-profiles are then projected onto the previously calculated eigengenes. The effect of noise on the eigengenes selected is illustrated in Fig. S5. As expected, the inclusion of noise shifts the eigenvalues associated with the selected eigengenes toward those that are larger in magnitude. In addition, the ability to predict growth suffers, both as more features are added in the stratified $k$ -fold case (Fig. S5A), and especially in the leave-one-GSE-out case (Fig. S5B).

Sensitivity of eigenvector selection

It is important to note that the eigenvectors selected to predict growth rate are not a unique set. The choice to include noise or use a different value for $\lambda$ lead to different eigenvectors being chosen. Furthermore, adding or subtracting experiments from the data used to calculate correlations necessarily changes the eigenvectors and eigenvalues.

Because the forward-selection algorithm finds the best-fitting eigenvector at each stage, different breakdowns of the cross-validation can change the selected eigenvector. In the version of MI-POGUE that incorporates the role of noise, fluctuations in the pseudo-profiles can likewise change the identity of the best-fitting eigenvector. Therefore, selection of the best feature must be repeated multiple times to account for the variability. Efforts to account for the uncertainty associated with each datapoint yield sets of features that perform less well in the leave-one-GSE-out case (see Fig. S5), underscoring the challenge of extending the model to predict outside data. Some of this prediction error could be mitigated with improved sampling of various stressful states.

As the predictions of growth rate are robust to the eigenvectors chosen, it appears that the eigenvectors of the gene-gene correlation matrix are related only indirectly to the biological underpinnings of the gene regulatory network. Currently, it is unclear whether there are better ways to decompose the gene expression that still produce accurate estimates of growth rate in a low-dimensional space. Furthermore, it is uncertain that more faithfully reproducing biological details will result in decompositions of gene expression that more accurately predict growth rate. The former uncertainty is well suited for additional study in machine learning, and a potential application for deep neural networks, assuming that enough data is available. The latter problem of developing biologically faithful models that still predict growth rate is one for systems biologists to systematically incorporate other sources of bioinformatic data, including the results of CBMs to improve the prediction of growth rate. In particular, incorporating other biological data sources will enable in silico prediction of the effects of genetic and environmental perturbations to the system. Nevertheless, we take a first step toward linking the eigenvectors to biological pathways in the next section.

Interpretation of eigengenes

We adapt Weighted Gene Coexpression Network Analysis (WGCNA) Langfelder2008 toward the interpretation of eigenvalues. Our strategy is to create a similarity measure based on each eigenvector, apply WGCNA on this measure, and then apply an annotation analysis method (PANTHER) on the resulting modules.

Each eigenvector, $\mathbf{p}_{l}=\mathbf{P}^{l}$ , can be transformed into a similarity measure using the outer product $\mathbf{A}^{l}=\mathbf{p}_{l}\mathbf{p}_{l}^{\intercal}$ and scaling by the reciprocal of the largest diagonal element $a_{l}=1/\max_{i}{A^{l}_{ii}}$ , yielding

[TABLE]

The similarity is a rescaled version of the eigenvector’s independent contribution to the overall correlation matrix. The rescaling is necessary to ensure that the application of WGCNA results in a connected network.

WGCNA applies soft thresholding by applying $|s_{ij}|^{\beta}$ , where the $s_{ij}$ are the elements of Eq. (S1), and $\beta=5$ is chosen such that the weighted degree distribution follows a power law. The choice of using absolute value corresponds to the “unsigned” option for constructing the weighted adjacency matrix from which the topological overlap matrix Ravasz2002 , $T$ , is calculated. Applying hierarchical clustering to $1-T$ using the “average” linkage method, we find modules—sets of genes that are more connected to one another than to the rest of the network (Table S2). The genes from each module are subjected to PANTHER Mi2017 (accessed at http://pantherdb.org/tools/uploadFiles.jsp) to find overrepresented annotations in the modules, which hints at the function of the eigengene. For each selected eigenvector, overrepresented annotations are reported in Table S3.

SI Methods

Linear models

The supplemental dataset S1 of ref. Airoldi2009 includes a function “calculateRates” in the namespace that estimates the growth rate from a gene-expression dataset based on a linear model of gene expression. The arguments of calculateRates are a gene-expression dataset, growth-rate parameters, and a calibration list of genes. The gene-expression datasets are the data frames obtained as described in the “Experimental Data” section of the Methods. The supplemental archive in ref. Airoldi2009 includes “frmeGRParameters” and “lsCalibration,” which supply the growth-rate parameters and calibration list, respectively. We call calculateRates on each of the three datasets to obtain the predicted growth rates. The real growth rates are included in the supplemental archive as “vdRealCharles,” “vdRealGresham,” and“vdRealSlavov.” We take the square of the correlation coefficient of the real rates with the calculated rates to get the value of $R^{2}$ .

Motivation for the objective function

We adapt a typical method for enforcing sparsity, known as Tikhonov regularization or ridge regression in the statistics community golub1999tikhonov , which imposes an $\ell_{2}$ regularization to select among solutions of an ill-posed least squares problem. In the context of linear least squares problems, Tikhonov regularization adds a term similar to the second Eq. (4) consisting of the squared magnitude of the linear coefficients to the least squares term (the first in Eq. (4)). In contrast to linear regression, KNN regression has no parameters to fit. In place of these parameters, we focus on the state-space occupancy of the joint distribution of discretized gene expression and growth rate (see Fig. 1B,C).

Therefore, we arrived at Eq. (4), which introduces parameters fixed by the organism ( $M$ ), the dataset ( $E$ , $N_{g}$ , $N_{\nu}$ , $g^{i}$ ) or a combination of the two ( $\vec{\nu}^{\,i}$ , $N_{p}$ ), in addition to a parameter $\lambda$ . The set, $M$ , is the set of all eigengenes. It is a pool from which subsets $S$ of fixed size $|S|$ are chosen, and the pool of genes whose expression composes the eigengenes are defined by those present in the transcriptomics chip. Each organism’s dataset has a fixed number of experiments, $E$ , and the bins of growth rate $N_{g}$ are determined by the accuracy of the measurement and the distribution of sampled growth rates, $g^{i}$ . Likewise, the measurement error and observed distribution fix the number of gene-expression bins, $N_{j}$ . Among the total number of state space bins, $N_{g}\prod_{j}^{J}N_{j}$ , $N_{p}$ are occupied by at least one experimental observation. The specific choices of the eigengenes to include in $S$ will determine the discretization of the gene-expression space, the fraction of the state space containing at least one observation, and the agreement of the growth-rate prediction $G^{(S)}(\vec{\nu}^{\,i})$ for each experiment $\vec{\nu}^{\,i}$ with the measured growth rate $g^{i}$ .

Choosing the parameter $\lambda$

In the Results, we briefly describe the limiting behavior for $\lambda$ , which is chosen to balance the least squares term (first) with the state-space occupancy term (second). Here, we go into further detail regarding how to empirically select $\lambda$ . We first note that as $\lambda\rightarrow\infty$ , the eigengenes that have a coarse-grained one-to-one correspondence (see Fig. 1B of the main text) with growth rate are selected, but as $\lambda\rightarrow 0$ the accuracy becomes the determinative factor. Therefore, there exist constants $A$ and $B$ such that for values of $\lambda>A$ the order of selected eigengenes is the same, and likewise for $\lambda<B$ .

Starting from Eq. (4) of the main text, we can bound constants $A$ and $B$ as follows in terms of $N_{g}$ , $N_{j}$ , $|S|$ , $E$ , and the expected accuracy of the growth-rate estimation, $\alpha$ . For simplicity, we approximate $N_{j}$ as a constant, as we observe in our datasets. We can rewrite the first term of Eq. (4) as $\alpha E$ . Next, we solve for the value of $\lambda$ for which the two terms are equal if $|S|=1$ and if $|S|\gg 1$ which result in the lower bound for $A$ and upper bound for $B$ , respectively. The first result is that $A>(\alpha E)/\left(\log\left(\frac{E^{2}}{N_{g}N_{j}}\right)\right)^{2}$ , and the second is that $B<(\alpha E)/(|S|\log N_{j})^{2}$ . Plugging in the values for the E. coli dataset ( $E=589$ , $N_{g}=16$ , $N_{j}=10$ ), and assuming $A>1$ in the first case and, letting $|S|=50$ , $B<0.005$ . The range tested in Fig. S1 extends slightly beyond [0.005,1], with the best fit value lying near the geometric mean.

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabási AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555.
2(2) Glass K, Ott E, Losert W, Girvan M (2012) Implications of functional similarity for gene regulatory interactions. J. R. Soc. Interface 9(72):1625–1636.
3(3) Ku WL, Duggal G, Li Y, Girvan M, Ott E (2012) Interpreting patterns of gene expression: Signatures of coregulation, the data processing inequality, and triplet motifs. PLOS ONE 7(2):e 31969.
4(4) Zañudo JGT, Albert R (2015) Cell fate reprogramming by control of intracellular network dynamics. PLOS Comput. Biol. 11(4):e 1004193.
5(5) Cornelius SP, Kath WL, Motter AE (2013) Realistic control of network dynamics. Nat. Commun.
6(6) Edwards JS, Palsson BØ (2000) The Escherichia coli MG 1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proc. Natl. Acad. Sci. USA 97(10):5528–5533.
7(7) Förster J, Famili I, Fu P, Palsson BØ, Nielsen J (2003) Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. 13(2):244–253.
8(8) Segrè D, Vitkup D, Church GM (2002) Analysis of optimality in natural and perturbed metabolic networks. Proc. Natl. Acad. Sci. USA 99(23):15112–15117.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Predicting Growth Rate from Gene Expression

Abstract

keywords:

Significance summary

Introduction

Results

Dataset overview

Eigengene estimation

Restriction of KNN models to the most informative eigengenes

Optimization of KNN regressors

Prediction comparison with existing methods

Discussion

Methods

Implementation of MI-POGUE

Discretization of data

Precursors of biomass

Experimental data

Determining optimal parameter values

Acknowledgements

Author contributions

Supplemental Information

Overview

SI Results

Alternative cross-validation strategies

Using genes instead of eigengenes

Noise sensitivity

Sensitivity of eigenvector selection

Interpretation of eigengenes

SI Methods

Linear models

Motivation for the objective function

Choosing the parameter λ\lambdaλ

Choosing the parameter $\lambda$