TL;DR
This study develops a machine learning model that predicts microbial growth rates from gene expression data, achieving high accuracy and reducing feature complexity, which can guide experimental optimization.
Contribution
It introduces a novel approach using k-nearest-neighbors regression to accurately predict growth rates from gene expression with significant feature reduction.
Findings
Predicts E. coli growth rate with 81% variance explained
Predicts S. cerevisiae growth rate with 89% variance explained
Reduces gene expression features from thousands to fewer than twenty
Abstract
Growth rate is one of the most important and most complex phenotypic characteristics of unicellular microorganisms, which determines the genetic mutations that dominate at the population level, and ultimately whether the population will survive. Translating changes at the genetic level to their growth rate consequences remains a subject of intense interest, since such a mapping could rationally direct experiments to optimize antibiotic efficacy or bioreactor productivity. In this paper, we directly map transcriptional profiles to growth rates by gathering published gene-expression data from Escherichia coli and Saccharomyces cerevisiae with corresponding growth-rate measurements. Using a machine-learning technique called k-nearest-neighbors regression, we build a model which predicts growth rate from gene expression. By exploiting the correlated nature of gene expression and sparsifyingā¦
| Selected Eigenvalue | Rank | Spearman Correlation |
|---|---|---|
| 0.053516 | 1829 | -0.040 |
| 0.052508 | 990 | 0.157 |
| 0.060603 | 310 | -0.264 |
| 0.060767 | 2125 | 0.002 |
| 0.159468 | 1994 | -0.019 |
| 0.077524 | 1287 | -0.113 |
| 0.095336 | 1355 | 0.105 |
| 0.098494 | 768 | -0.188 |
| 0.079097 | 542 | -0.222 |
| (Including Noise) | ||
| 32.221929 | 50 | 0.354 |
| Ā Ā Ā Ā 3.039641 | 1896 | -0.032 |
| Ā Ā Ā Ā 3.114244 | 1519 | 0.083 |
| Ā Ā Ā Ā 1.488373 | 1081 | -0.143 |
| Ā Ā Ā Ā 3.386905 | 1963 | 0.023 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Predicting Growth Rate from Gene Expression
Thomas P. Wytock
Department of Physics and Astronomy, Northwestern University, Evanston, IL 60208, USA
āā
Adilson E. Motter
Department of Physics and Astronomy, Northwestern University, Evanston, IL 60208, USA
Northwestern Institute on Complex Systems, Evanston, IL 60208, USA
Physical Sciences-Oncology Center, Northwestern University, Evanston, Illinois 60208, USA
Abstract
Growth rate is one of the most important and most complex phenotypic characteristics of unicellular microorganisms, which determines the genetic mutations that dominate at the population level, and ultimately whether the population will survive. Translating changes at the genetic level to their growth rate consequences remains a subject of intense interest, since such a mapping could rationally direct experiments to optimize antibiotic efficacy or bioreactor productivity. In this paper, we directly map transcriptional profiles to growth rates by gathering published gene-expression data from Escherichia coli and Saccharomyces cerevisiae with corresponding growth-rate measurements. Using a machine-learning technique called -nearest-neighbors regression, we build a model which predicts growth rate from gene expression. By exploiting the correlated nature of gene expression and sparsifying the model, we capture 81% of the variance in growth rate of the E. coli dataset while reducing the number of features from over 4,000 to nine. In S. cerevisiae, we account for 89% of the variance in growth rate while reducing from over 5,500 dimensions to 18. Such a model provides a basis for selecting successful strategies from among the combinatorial number of experimental possibilities when attempting to optimize complex phenotypic traits like growth rate.
Wytock, T. P., & Motter, A. E. (2019). Predicting growth rate from gene expression. Proc. Natl. Acad. Sci. USA, 116(2), 367ā372
keywords:
biological networks machine learning systems biology metabolic networks data science
Significance summary
Connecting genetic changes to organismal function has been a central problem of biology for decades. Understanding the genetic underpinnings of functional traits like growth rate remains incomplete despite efforts to uncover metabolic and gene regulatory networks. Here, we leverage correlations derived from large-scale datasets of E. coli and S. cerevisiae to construct a mapping between gene expression and growth using the -nearest neighbors technique. Our mapping can predict growth rate more accurately than previous methods, while compressing gene-expression data from thousands of genes to tens of features without requiring network structure identification. This model can be applied to generate hypotheses, design experiments, and reduce the amount of trial and error in research.
Introduction
Mapping genotype to phenotype remains a central challenge in molecular biology. In the past two decades, complex networks have emerged as a tool to organize the vast amount of data generated by genomic technologies toward mapping biochemical patterns to whole-system function. Applications include the meta-analysis of genetic interactions across organisms to find biologically conserved structuresĀ Ravasz2002 , the network analysis of gene annotation relationships to interpret the expression changes in gene setsĀ Glass2012 ; Ku2012 , the curation of genetic relationships into gene-regulatory networks to predict the outcome of proposed interventionsĀ Zanudo2015 ; Cornelius2013 , and the reconstruction of metabolic networks on which metabolic capacity is calculated through constraint-based models (CBMs)Ā Edwards2000 ; Forster2003a ; Segre2002 . These tools have also been used to explain gene essentiality and epistasisĀ Forster2003b ; Gerdes2003 ; Segre2005a , and to study disease progression and treatmentĀ Zanudo2015 ; Schlauch2017 .
Though successful, these strategies all rely on the labor-intensive task of determining the network of relationships between genes. They resolve the network structure through a combination of aggregation of prior knowledge and targeted experimentation, but the benefit that network models provide by structuring data are limited by the problems they address or the conditions under which they apply. For example, annotation methods are associative and qualitative, which limit the potential for causal attribution and inter-study comparison, respectively. Meanwhile, precise dynamic models of gene-regulatory networks often require the (challenging) measurement of in vivo kinetic parameters or other condition-specific quantities to validate the dynamical rules. On the other hand, CBMs require experiments with well-defined media and measurements of metabolic uptake rates for flux-balance analysis (FBA) to yield accurate maximal rates of biomass production. In addition, FBA assumes that the cell directs its metabolic activities to maximize cell growth and is fully adapted to its environment both before and after a perturbationĀ Edwards2000 ; Fong2004 , although alternate methods have been developed that relax this restrictive requirement on the final stateĀ Segre2002 ; Shlomi2005 or more generallyĀ Mahadevan2003 .
In this paper, we establish a complementary method to predict growth rate using only gene-expression data, which we refer to as Model-Independent Prediction Of Growth Using Expression (MI-POGUE). Even though our method focuses on growth rate, it provides a novel strategy to answer the more general questions of how whole-cell gene expression affects phenotypic changes and thus of how to convert genome-wide observations into quantitative phenotypic predictions. The novelty and flexibility of MI-POGUE derive from using an effective model of genetic interactions in lieu of relying on prior knowledge or specialized experiments.
We develop MI-POGUE by retrieving large datasets of gene expression and growth rate in E. coliĀ Carrera2014 and S. cerevisiaeĀ Hughes2000 ; Airoldi2009 ; Airoldi2016 ; Slavov2011 ; Lu2009 ; Kemmeren2014 . Comprising thousands of individual observations, these datasets allow the direct measurement of gene-gene correlations present in cells, which form the basis of an effective model of genetic regulation. In our approach, we transform the gene-expression data into weighted combinations of genes derived from the gene-gene correlations called āeigengenesāĀ Alter2000 , and predict growth rate by averaging the growth rates associated with the gene-expression profiles most similar to a given target profileāa technique known as -nearest neighborsĀ (KNN) regression Altman1992 . The efficacy of MI-POGUE is substantiated by comparing it with state-of-the-art methods for predicting growth rate.
The data-driven conception of MI-POGUE sidesteps the network identification problem while still accounting for all observed changes to intracellular networks in response to perturbations. Given that nonlinearity allows small changes in part of the cell to effect large changes in another part, broad-based strategies like MI-POGUE that account for changes across the whole genome promise to open new lines of inquiry in investigating fundamental systems biology, as well as in engineering microorganisms, and designing antibiotics.
Results
Dataset overview
MI-POGUE requires a number of paired gene-expression and growth-rate measurements large enough to form a representative sample of potential organismal growth conditions to provide accurate growth estimates. We apply MI-POGUE to both E. coli and S. cerevisiae. TableĀ 1 establishes that the size of the dataset we consider is unusually large for each organism.
The E. coli dataset is derived from ref.Ā Carrera2014 and includes a broad sample of environmental conditions, measuring the effects of heat shock, hypoxia, or adaptive evolution on a variety of carbon sources in addition to over 150 genetic perturbations. The primary substrains of E. coli K12 featured in these experiments are MG1655 and BW25113. The S. cerevisiae dataset, which serves to demonstrate its applicability to eukaryotes, comprises experiments performed in chemostats with various environmental stresses and nutrient limitations with gene-expression data taken from refs.Ā Hughes2000 ; Airoldi2009 ; Airoldi2016 ; Slavov2011 ; Lu2009 ; Kemmeren2014 and growth rate taken from refs.Ā Airoldi2009 ; Airoldi2016 ; Slavov2011 ; Lu2009 . The metadata annotating the experiments is curated from these references, as described in the Methods, and they are provided with MI-POGUEās source codeĀ Wytock2018 .
Eigengene estimation
The expression between genes is highly correlatedĀ Alter2000 , implying that each geneās expression depends on its neighbors in the network. Here, we derive eigengenes, which are combinations of genes that reorganize expression according to interdependencies implicitly mediated by the gene regulatory network. Let and be the growth rate and gene expression, respectively, of the experiment, and let be an index over genes, and let and be the set of all growth-rate measurements and their associated gene-expression profiles, respectively. Furthermore, let be the total number of experiments including a growth-rate measurement, be the number of gene-expression measurements used to estimate correlations, and be the total number of genes common to each measurement.
We estimate the correlations between genes based on all the available expression data (regardless of whether it had associated growth rate or not) and calculate the eigenvectors. Briefly, we compute the gene-gene (Pearson) correlation matrix where is the Pearson correlation coefficient between the expression of the and genes. The correlation matrix is square ( by ) and symmetric under exchange of indices. We diagonalize the matrix,
[TABLE]
resulting in the matrix in which each column corresponds to the eigenvector while each row reflects the geneās projection onto the set of eigenvectors. The diagonal matrix indicates the amount of correlation occurring along the column of . Because the correlation matrix is symmetric, we have
[TABLE]
Any expression profile may be projected onto the correlation eigenvectors by matrix multiplication:
[TABLE]
We call the eigengenes, a portmanteau of āeigenā (proper) and āgeneāĀ Alter2000 , because they represent independent (that is, non-redundant) variations in gene-expression space. Therefore, increasing or decreasing the magnitude of one eigengeneās expression (), leaves the other projections unchanged (). In contrast, changing the expression of the gene, , would result in changes in other genes, modulated by , allowing this change to have wide-ranging impacts across the gene-expression profile.
Restriction of KNN models to the most informative eigengenes
The method of KNN regression is a machine-learning technique trained on a set of paired measurements of independent and dependent variables that assigns an average of selected dependent variables to a test measurement of independent variables, where the dependent variables are selected by testing independent variablesā similarity with the training measurements. Here, the dependent variable is growth rate and the independent variables are the elements of a gene-expression profile as illustrated in Fig.Ā 1A. The output of the KNN-fitting process is called a regressor, which takes gene expression as input and outputs an estimate of growth rate. From the set of all calculated eigengenes, we restrict to those with greatest potential to inform growth rate by searching for eigengenes that vary (possibly non-linearly) the most as growth rate changes. We discretize each measurement into bins of both growth rate and gene expression, and search for eigengenes that most evenly distribute the experiments into bins and thus span the range of variation observed.
FigureĀ 1B illustrates in grids the joint distributions between growth rate and gene expression for two eigengenes. In this example, we suppose that each eigengene (and growth rate) can be in one of three states and discretize expression into these bins. Bins with higher densities of observations have darker colors. Any single eigengene places limited constraints on the possible values of growth rate, but by adding more eigengenes the growth-rate possibilities for a given eigengene-expression profile narrow. The increased specificity comes at the cost of increasing the number of possible states an experiment could occupy, thereby increasing the sensitivity to noise (as diagrammed in Fig.Ā 1C). Models that incorporate enough eigengenes to estimate growth rate but avoid overfitting maximize their ability to accurately explain the observed data and while retaining the ability to predict new data.
Optimization of KNN regressors
To select features that predict growth, we require an objective function that balances the explanatory and predictive capabilities of the regressor, called . We first quantify the explanatory capabilities of , which is characterized by the set of features that define the gene-expression subspace in which neighboring experiments are determined. The argument of is an experiment, , where as before is an index over experiments. We then determine the accuracy of using the squared difference with the experimentally measured growth rate .
As the number of eigengenes in increases, the predictions converge toward the measured growth rate, but the rate of convergence slows as models incorporate eigengenes. Therefore, we introduce a criterion to quantify how efficient is in terms of state space. This term measures whether the decrease in error is large enough to justify the addition of another eigengene. In a maximally efficient model, each unique combination of eigengene-expression levels would have a corresponding range of growth rates, with no expression combination excluded. In other words, the number of bins occupied by the experiments in the dataset () would be equal to the number of growth rate bins in Fig.Ā 1B (). At the same time, would also be equal to the total possible number of configurations , where is the number of bins for the eigengene. We take the square of the (natural) logarithm of each ratio to obtain , which we refer to as the state-space occupancy. Finally, we introduce the regularization parameter to balance the relative contribution of the explanatory and predictive terms, yielding:
[TABLE]
The value of at the optimal value of is case-dependent and empirically found to be near for E. coli (Methods andĀ Fig.Ā S1) and for S. cerevisiae. Asymptotically, the optimal shifts toward smaller numbers as and toward larger numbers as .
We compare the performance of the various with Eq.Ā (4) by dividing the dataset into training and test data consisting of gene-expression profiles paired with growth rate. We choose to employ āstratified, five-fold cross-validationā which divides the existing data into subsets, called āfolds,ā whose distribution of growth rates is constrained to match the distribution of the entire dataset as closely as possible. In testing the generalizability of the regressor, the dataset comprises five equally sized folds, and four folds are used as training data to fit the regressor, which is tested on the fifth. Cross-validation is repeated with each fold used as test data once, yielding predictions for each of the experiments in the dataset. To account for variability in predictions due to fold construction, we average the predictions over 100 divisions of the dataset.
Finding the optimal set of eigengenes requires the testing of each possible set , the number of which grows combinatorially. In view of the huge number of possibilities, we employ the āforward-selectionā heuristic, which builds the set by adding eigengenes one at a time to find a set that is close to optimalĀ efron2004least . Starting with , and continuing for each size of , we rank all candidate regressors by the cross-validation procedure described in the previous paragraph. Next, we take the top-ranked set of eigengenes and form candidate sets of features of size by adding each of the remaining features to the top-ranked set of eigengenes. The cross-validation process continues until , at which point the root mean squared error (RMSE) of the predictions has stopped improving.
With the process for evaluating regressors in mind, we detail the optimal regressors for each value of in Fig.Ā 2. Strikingly, the models including 9 and 18 eigengenes out of thousands achieve a better balance of accuracy and predictability than by including much larger gene-based models in both E. coli and S. cerevisiae (TableĀ 2). The small number of features at the minimum compresses the gene-expression information relevant to growth rate into a relatively low-dimensional subspace thereby facilitating further analysis.
For the best regressor composed of each number of eigengenes, we examine the trend in the square root of the sum of squared errors (SSE) (Fig.Ā S2A)) and the state-space occupancy (Fig.Ā S2B) corresponding to the first and second terms of Eq.Ā (4), respectively. The SSE trend in Fig.Ā S2A shows the stagnating improvements in accuracy, despite the geometric decrease in the state-space occupancy by observations in Fig.Ā S2B. The large unoccupied fraction of state space hampers predictability as it is unclear how to extrapolate to hypothetical observations in this region.
Prediction comparison with existing methods
In TableĀ 2, we compare the quality of predictions based on eigengenes with those based on the precursors of biomassāthat is, all the genes included in the metabolic reconstruction for each organism. We obtained the precursors of biomass from iJO1366Ā Orth2011 , containing 1,352 genes for E. coli, and from YeastĀ 7Ā Aung2013 , containing 897 genes for S. cerevisiae, and applied MI-POGUE to predict growth rate based on these genesā expression only. For both organisms, models built on eigengenes have a higher coefficient of determination () and lower RMSE than those built on precursors of biomass, despite requiring fewer features.
Given the overall growth-rate prediction accuracy in both organisms, we investigate the accuracy at the level of the individual experiments in Fig.Ā 3A. MI-POGUE performs comparably across most of the strains present in our dataset. In E. coli, 348 of 589 of the predicted experiments fall inside the 5% error (grey region), despite systematic underestimation of the fastest and overestimation of the slowest growth rates. S. cerevisiae shows a similar pattern, with a slightly higher fraction of experiments with less than 5% error (66 of 107). The KNN approach systematically overestimates the slowest growth rates and underestimates the fastest growth rates because, by construction, the nearest neighbors of the slowest growth state will have faster-growth neighbors and vice-versa.
Nevertheless, the level of accuracy motivated us to compare MI-POGUE with other existing methods using as a metric in TableĀ 3. MI-POGUE achieves agreement with experimental measurements superior to other reported methods without need for the additional step of converting the value of the biomass objective function to growth rate. MI-POGUEās accuracy outpaces that of CBMs without requiring metabolic uptake rates or enzyme kinetic parameters.
We demonstrate the flexibility of our method by applying it to predict the growth rate of S. cerevisiae grown in chemostats (Fig.Ā 3B). We also predicted growth using the linear models from ref.Ā Airoldi2009 for comparison (SIĀ Methods). With , MI-POGUE far outpaces the moderate success of the linear growth model for S. cerevisiaeĀ Airoldi2009 , which has . Compared to E. coli, MI-POGUE performs slightly better in S. cerevisiae, despite the additional complexity of mapping gene expression to functionĀ Ku2012 . This improvement in performance may be attributed to the relative sizes of the datasets and diversity of the conditions considered.
The success of MI-POGUE compared to the other methods derives, in part, from using eigengenes instead of single genes, but at first one might think that this choice obscures the biological role of eigengenes. We note, however, that Weighted Gene Coexpression Network Analysis (WGCNA)Ā Langfelder2008 has recently been employed in fungi to associate gene modules with qualitative growth statesĀ Baltussen2018 . Our approach is quantitative rather than qualitative but we can borrow this tool to interpret our eigengenes. Specifically, we develop a method to associate biological functions with eigengenes in E. coli using WGCNA and Protein Analysis Through Evolutionary Relationships (PANTHER)Ā Mi2017 . For each selected eigengene, we take the outer product of the eigengene with itself, resulting in a by similarity matrix that is rescaled such that the largest diagonal element is one (see Methods for full details). The rescaled matrix is subjected to WGCNA, yielding a module of genes associated with each eigenvector. We use PANTHER to identify the most over- and underrepresented Gene Ontology (GO) Biological Process annotations in each module (defined by , Fisherās Exact Test).
We find that the top GO terms associated with modules derived from five of the nine eigengenes are overrepresented for polysaccharide, phospholipid, lipid, fatty acid, and amino acid metabolism. In two others, DNA repair and DNA metabolism are overrepresented. Of the remaining two, one is defined by its underrepresentation of metabolic genes, while the other has no terms meeting the -value thresholdāreflecting the sometimes weak association between eigengenes and GO terms. That the nine selected eigengenes have largely nonoverlapping annotations is a result of the forward-selection process. Once an eigengene that captures one biological process is selected, it is less likely that a second eigengene capturing the same process will be selected. The full gene lists and annotation lists are reported in TablesĀ S2Ā andĀ S3.
Discussion
MI-POGUE both addresses the challenges faced by previous methods and reduces the labor necessary to construct models that map gene expression to phenotype. It fully incorporates gene expression, relaxes the requirement that organisms be completely adapted to their environment, reduces reliance on metabolic uptake rates, and avoids the necessity of estimating enzyme kinetic parametersĀ Teusink2000 ; VanEunen2012 ; Chubukov2014 . Furthermore, MI-POGUE can be applied broadly, even in cases where the growth media are not strictly defined. Because MI-POGUE is flexible, it can repurpose previous measurements without requiring extensive targeted experiments to determine the structure of intracellular networks. Its ability to reduce the relevant features to a small number of eigengenes allows for genome-wide data to be expressed succinctly without loss of predictive power. These advantages are achieved while simultaneously improving the capacity to predict growth rate.
MI-POGUE can characterize the independence, synergy, or antagonism of perturbation pairs by evaluating the growth rate of a hypothetical transcriptional state constructed by adding the (experimentally measured) transcriptional responses of two perturbations to a reference state. Whereas local models leave open the possibility that some unobserved gene accounts for growth deviations from independence, our method implies that, within limitations of the available data, such deviations result from non-genetic mechanismsĀ PenalverBernabe2016 .
The approach of generating proposed states based on experimentally measured gene-expression responses to perturbations can also pre-screen experimental hypotheses. Such screening has the advantage of accounting for the real response of cells as opposed to the simulated response based on network structureĀ Alter2000 ; Brauer2008 . In the case of two perturbations where both are genetic, MI-POGUE can be used to estimate the growth of expression profiles resulting from both individual perturbations and the double perturbation, yielding a computational prediction of growth-rate epistasis, thereby providing a new tool to understand itĀ Segre2005a . MI-POGUE could also be integrated with transcriptional regulatory network from databasesĀ Gama-Castro2015 to predict the impact of previously unmeasured gene perturbations.
Researchers can incorporate MI-POGUE with existing strategies that interpret genetic networks in order to improve their effectiveness, as we demonstrate by using WGCNA and PANTHER to interpret the biological roles of eigengenes. Genomic footprintingĀ Gerdes2003 , previously used to find essential genes, could be used to resolve regions of gene expression that yield zero growth, which can enhance the ability of the KNN algorithm to extrapolate beyond the training data.
The applicability of MI-POGUE to metabolic engineering, antibiotic development, and systems biology is expected to encourage its adoption and further refinement or the adoption of similar methods. For example, metabolic engineers could tailor MI-POGUE to offer predictions of a key uptake or secretion rate based on the organismās gene expression. Antibiotic developers could use transcriptional changes in response to drugs to choose combinations that result in the slowest growth rate as predicted by MI-POGUE. Systems biologists could use MI-POGUE to look for interactions between genes by taking transcriptional responses to single knockouts, adding them, and simulating the outcome. We also note that the mapping of eigengenes to biological functions as described here merit further investigation. Using modern community-detection algorithms, such as weighted stochastic block modelsĀ Aicher2014 , can help discern finer-scale structure of gene modules
Given that cells are complex systems, weak and indirect interactions at the molecular level can influence behavior at the whole-cell level. Reductionist strategies are poorly suited to study these phenomena, but approaches like MI-POGUE, which combine machine learning with bioinformatic ābig data,ā have the potential to capture these subtle effects. As systems biologists adapt machine-learning techniques to better interpret high-throughput data, the new interpretative power of these techniques has the potential to reveal under-appreciated and sometimes counter-intuitive effects that will drive the field into the future.
Methods
Implementation of MI-POGUE
We used an implementation of the nearest-neighbors algorithmĀ Altman1992 found in the Python sklearn packageĀ Pedregosa2011 . For E. coli (S. cerevisiae) models, we chose () to be the number of neighbors as this number was shown to perform better than other choices (SIĀ Appendix, Figs.Ā S3Ā andĀ S4). Additional details about the development and extensions of MI-POGUE are described in the SIĀ Appendix, with the code and instructions for running MI-POGUE is available in ref.Ā Wytock2018 .
Discretization of data
The growth-rate bins are fixed so that the number of experiments in each bin is approximately the same. In addition, for a given feature, every tenth percentile (that is, the ) of the projection of gene expression onto that feature is calculated from the available data. If the difference between consecutive percentiles (that is, the bin width) is larger than 10% of the mean of the consecutive percentiles (the bin midpoint), then the bin is left in place. When the width is smaller, we randomly choose either the previous or subsequent percentile and merge the data into a larger bin, recalculating the width and midpoint. The bin-merging procedure continues until all binsā widths are larger than 10% of their midpoints.
Precursors of biomass
We downloaded the metabolic model iJO1366Ā Orth2011 from E. coli and all those from S. cerevisiae considered in ref.Ā Heavner2015 from the supplementary material provided with the associated publications.
Experimental data
For E. coli, we downloaded gene-expression data and experimental metadata from ref.Ā Carrera2014 . A lightly edited version of supplementaryĀ tableĀ 2 from ref.Ā Carrera2014 describing the full E. coli dataset can be found at the authorās GitHub.
For S. cerevisiae, we downloaded gene-expression data and growth-rate data from ref.Ā Airoldi2009 , packaged as an ā.RDataā archive in the datasetĀ S1 in that reference. Loading the archive into R, we used the data frames that reported gene-expression data for strains growing at a fixed rate: āfrmeDataCharlesāĀ Lu2009 and āfrmeDataGreshamāĀ Airoldi2016 . To these, we added data from ref.Ā Slavov2011 (downloaded from http://genomics-pubs.princeton.edu/grr/), whose growth-rate, but not gene-expression, data are included in the R archive. These three datasets shared 5,527 unique genes and 107 total experiments.
For the purpose of estimating the correlations between genes in S. cerevisiae, we obtained data from two large-scale screens of gene knockoutsĀ Hughes2000 ; Kemmeren2014 . The 300 expression profiles ofĀ Hughes2000 were used as provided in the RData archive. Raw data from ref.Ā Kemmeren2014 were downloaded from GEO and preprocessed as described in the supplement of that reference. Following ref.Ā Kemmeren2014 , we excluded gene-expression profiles with fewer than four genes with significant responses to the gene deletion. These 1,369 experiments comprising 700 responsive strains were identified from supplementalĀ tableĀ S1 of ref.Ā Kemmeren2014 .
Determining optimal parameter values
The optimal number of neighbors was empirically determined in two ways: first by starting at the previously identified set of optimal eigengenes for and cross-validating the dataset with different numbers of neighbors, and second by re-running feature selection with an optimal number of neighbors obtained from the first method. The first case is illustrated by SIĀ Appendix, Fig.Ā S3 for a range of values for , the number of nearest neighbors. We repeated the feature selection for , determined the maximum for of the peak, and found that this performs less well than the case.
We repeated the feature selection for various values of the regularization parameter , which controls the relative weighting of the prediction error and the state-space occupancy terms of Eq.Ā (4) as shown in SIĀ Appendix, Fig.Ā S1. Since the forward-selection algorithm finds the best-fitting eigengene at each stage, changes to and the fold divisions can lead to different selections from the available eigengenes. As before, the variability in eigengenes due to fold selection can be mitigated by repeating the selection and taking the eigenvector that performs best on average. In E. coli, the maximal achieved for in both the -fold and leave-one-GSE-out cross-validation strategies is greater than that achieved by the other values. Additional tests led to us selecting . A similar approach was used to select for the S. cerevisiae dataset.
Acknowledgements
This work was supported by NIH/NIGMS R01GM113238 and NIH/NCI 1U54CA193419. TPW also acknowledges support from NSF-GRFP fund No. DGE-0824162 as well as NIH/NIGMS 5T32GM008382.
Author contributions
T.P.W. and A.E.M. designed research; T.P.W. performed research; T.P.W. and A.E.M. analyzed data; and T.P.W. and A.E.M. wrote the paper.
Supplemental Information
Overview
The SI Results are an extended description of the parameter estimation, cross-validation, and eigengene interpretation that we performed on MI-POGUE. The SI Methods describe the implementation of the linear models, present additional context for MI-POGUEās objective function, and provide guidance on how to select the parameter .
SI Results
Alternative cross-validation strategies
In the main text, we adopt a stratified -fold cross-validation strategy. This strategy can be thought of an upper bound of MI-POGUEās accuracy. In E. coli, we additionally investigate the possibility of overfitting using a strategy we call āleave one GSE outā in which all the gene-expression and growth-rate measurements associated with a particular Gene Expression OmnibusĀ Barrett2012 (GEO) Series accession number (i.e., GSE) are withheld from the training set (see black dashed curves reproduced in Figs.Ā S1, S4, S5 andĀ S7). In S. cerevisiae the equivalent method is āleave one group out.ā This strategy is the most stringent and is equivalent to applying MI-POGUE to unseen data. In this case, the peak values for are in the range of 0.45 for E. coli and 0.36 for S. cerevisiae (green curve in Fig.Ā S6).
Specifically in S. cerevisiae, we implement two more strategies. In the first, we excluded the set of experiments with a particular growth rate (orange curve in Fig.Ā S6). In the second, we excluded all experiments undergoing a particular treatment; for example, all strains grown in phosphate limiting conditions (red curve in Fig.Ā S6). These two cases exhibit different behaviors as the number of features used in the regressor increases. The predictions seem to improve slightly in the case of an excluded growth rate, because the additional features aid interpolation of the growth rate. Conversely, caution is needed when extrapolating to new treatments, as adding more features in this case causes accuracy to decline due to overfitting.
Using genes instead of eigengenes
We sought to establish whether models formed with eigengenes performed better than those formed with genes by performing feature selection in both instances and comparing the resulting models. The results are shown in Fig.Ā S7. In the stratified -fold case, gene-based models appear to outperform eigengene-based models for models with less than 10 features. However, eigenegene-based models remain preferable because they achieve a higher at large feature numbers in the stratified -fold case, they achieve a higher in the leave-one-GSE-out case, and they have a smaller number of features to search through, which reduces the computational time. We note that the selected eigengenes are not individually correlated with growth (TableĀ S1). This is a reflection of the non-linear and non-parametric nature of KNN regression.
Noise sensitivity
The optimal features include eigenvectors associated with small eigenvalues. These small eigenvalues tend to be sensitive to the level of noise included in the features. The effect of noise can be simulated by first calculating the mean and variance of each gene across the 2,196 experiments. This mean and variance are used to define a Gaussian distribution. We use this distribution to generate āpseudo-profilesā (i.e., simulated data) and model the effect of noise by including four pseudo-profiles per experiment in the training set when testing the KNN regressor. Each pseudo-profile is assigned a growth rate that is generated by taking the actual measurement as the mean and imposing a 5% error rate about this mean. The pseudo-profiles are then projected onto the previously calculated eigengenes. The effect of noise on the eigengenes selected is illustrated in Fig.Ā S5. As expected, the inclusion of noise shifts the eigenvalues associated with the selected eigengenes toward those that are larger in magnitude. In addition, the ability to predict growth suffers, both as more features are added in the stratified -fold case (Fig.Ā S5A), and especially in the leave-one-GSE-out case (Fig.Ā S5B).
Sensitivity of eigenvector selection
It is important to note that the eigenvectors selected to predict growth rate are not a unique set. The choice to include noise or use a different value for lead to different eigenvectors being chosen. Furthermore, adding or subtracting experiments from the data used to calculate correlations necessarily changes the eigenvectors and eigenvalues.
Because the forward-selection algorithm finds the best-fitting eigenvector at each stage, different breakdowns of the cross-validation can change the selected eigenvector. In the version of MI-POGUE that incorporates the role of noise, fluctuations in the pseudo-profiles can likewise change the identity of the best-fitting eigenvector. Therefore, selection of the best feature must be repeated multiple times to account for the variability. Efforts to account for the uncertainty associated with each datapoint yield sets of features that perform less well in the leave-one-GSE-out case (see Fig.Ā S5), underscoring the challenge of extending the model to predict outside data. Some of this prediction error could be mitigated with improved sampling of various stressful states.
As the predictions of growth rate are robust to the eigenvectors chosen, it appears that the eigenvectors of the gene-gene correlation matrix are related only indirectly to the biological underpinnings of the gene regulatory network. Currently, it is unclear whether there are better ways to decompose the gene expression that still produce accurate estimates of growth rate in a low-dimensional space. Furthermore, it is uncertain that more faithfully reproducing biological details will result in decompositions of gene expression that more accurately predict growth rate. The former uncertainty is well suited for additional study in machine learning, and a potential application for deep neural networks, assuming that enough data is available. The latter problem of developing biologically faithful models that still predict growth rate is one for systems biologists to systematically incorporate other sources of bioinformatic data, including the results of CBMs to improve the prediction of growth rate. In particular, incorporating other biological data sources will enable in silico prediction of the effects of genetic and environmental perturbations to the system. Nevertheless, we take a first step toward linking the eigenvectors to biological pathways in the next section.
Interpretation of eigengenes
We adapt Weighted Gene Coexpression Network Analysis (WGCNA)Ā Langfelder2008 toward the interpretation of eigenvalues. Our strategy is to create a similarity measure based on each eigenvector, apply WGCNA on this measure, and then apply an annotation analysis method (PANTHER) on the resulting modules.
Each eigenvector, , can be transformed into a similarity measure using the outer product and scaling by the reciprocal of the largest diagonal element , yielding
[TABLE]
The similarity is a rescaled version of the eigenvectorās independent contribution to the overall correlation matrix. The rescaling is necessary to ensure that the application of WGCNA results in a connected network.
WGCNA applies soft thresholding by applying , where the are the elements of Eq.Ā (S1), and is chosen such that the weighted degree distribution follows aĀ power law. The choice of using absolute value corresponds to the āunsignedā option for constructing the weighted adjacency matrix from which the topological overlap matrixĀ Ravasz2002 , , is calculated. Applying hierarchical clustering to using the āaverageā linkage method, we find modulesāsets of genes that are more connected to one another than to the rest of the network (TableĀ S2). The genes from each module are subjected to PANTHERĀ Mi2017 (accessed at http://pantherdb.org/tools/uploadFiles.jsp) to find overrepresented annotations in the modules, which hints at the function of the eigengene. For each selected eigenvector, overrepresented annotations are reported inĀ TableĀ S3.
SI Methods
Linear models
The supplementalĀ datasetĀ S1 of ref.Ā Airoldi2009 includes a function ācalculateRatesā in the namespace that estimates the growth rate from a gene-expression dataset based on a linear model of gene expression. The arguments of calculateRates are a gene-expression dataset, growth-rate parameters, and a calibration list of genes. The gene-expression datasets are the data frames obtained as described in the āExperimental Dataā section of the Methods. The supplemental archive in ref.Ā Airoldi2009 includes āfrmeGRParametersā and ālsCalibration,ā which supply the growth-rate parameters and calibration list, respectively. We call calculateRates on each of the three datasets to obtain the predicted growth rates. The real growth rates are included in the supplemental archive as āvdRealCharles,ā āvdRealGresham,ā andāvdRealSlavov.ā We take the square of the correlation coefficient of the real rates with the calculated rates to get the value of .
Motivation for the objective function
We adapt a typical method for enforcing sparsity, known as Tikhonov regularization or ridge regression in the statistics communityĀ golub1999tikhonov , which imposes an regularization to select among solutions of an ill-posed least squares problem. In the context of linear least squares problems, Tikhonov regularization adds a term similar to the second Eq.Ā (4) consisting of the squared magnitude of the linear coefficients to the least squares term (the first in Eq.Ā (4)). In contrast to linear regression, KNN regression has no parameters to fit. In place of these parameters, we focus on the state-space occupancy of the joint distribution of discretized gene expression and growth rate (see Fig.Ā 1B,C).
Therefore, we arrived at Eq.Ā (4), which introduces parameters fixed by the organism (), the dataset (, , , ) or a combination of the two (, ), in addition to a parameter . The set, , is the set of all eigengenes. It is a pool from which subsets of fixed size are chosen, and the pool of genes whose expression composes the eigengenes are defined by those present in the transcriptomics chip. Each organismās dataset has a fixed number of experiments, , and the bins of growth rate are determined by the accuracy of the measurement and the distribution of sampled growth rates, . Likewise, the measurement error and observed distribution fix the number of gene-expression bins, . Among the total number of state space bins, , are occupied by at least one experimental observation. The specific choices of the eigengenes to include in will determine the discretization of the gene-expression space, the fraction of the state space containing at least one observation, and the agreement of the growth-rate prediction for each experiment with the measured growth rate .
Choosing the parameter
In the Results, we briefly describe the limiting behavior for , which is chosen to balance the least squares term (first) with the state-space occupancy term (second). Here, we go into further detail regarding how to empirically select . We first note that as , the eigengenes that have a coarse-grained one-to-one correspondence (see Fig.Ā 1B of the main text) with growth rate are selected, but as the accuracy becomes the determinative factor. Therefore, there exist constants and such that for values of the order of selected eigengenes is the same, and likewise for .
Starting from Eq.Ā (4) of the main text, we can bound constants and as follows in terms of , , , , and the expected accuracy of the growth-rate estimation, . For simplicity, we approximate as a constant, as we observe in our datasets. We can rewrite the first term of Eq.Ā (4) as . Next, we solve for the value of for which the two terms are equal if and if which result in the lower bound for and upper bound for , respectively. The first result is that , and the second is that . Plugging in the values for the E. coli dataset (, , ), and assuming in the first case and, letting , . The range tested in Fig.Ā S1 extends slightly beyond [0.005,1], with the best fit value lying near the geometric mean.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Ravasz E, Somera AL, Mongru DA, Oltvai ZN, BarabĆ”si AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551ā1555.
- 2(2) Glass K, Ott E, Losert W, Girvan M (2012) Implications of functional similarity for gene regulatory interactions. J. R. Soc. Interface 9(72):1625ā1636.
- 3(3) Ku WL, Duggal G, Li Y, Girvan M, Ott E (2012) Interpreting patterns of gene expression: Signatures of coregulation, the data processing inequality, and triplet motifs. PLOS ONE 7(2):e 31969.
- 4(4) ZaƱudo JGT, Albert R (2015) Cell fate reprogramming by control of intracellular network dynamics. PLOS Comput. Biol. 11(4):e 1004193.
- 5(5) Cornelius SP, Kath WL, Motter AE (2013) Realistic control of network dynamics. Nat. Commun.
- 6(6) Edwards JS, Palsson BĆ (2000) The Escherichia coli MG 1655 in silico metabolic genotype: Its definition, characteristics, and capabilities. Proc. Natl. Acad. Sci. USA 97(10):5528ā5533.
- 7(7) Fƶrster J, Famili I, Fu P, Palsson BĆ, Nielsen J (2003) Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. 13(2):244ā253.
- 8(8) SegrĆØ D, Vitkup D, Church GM (2002) Analysis of optimality in natural and perturbed metabolic networks. Proc. Natl. Acad. Sci. USA 99(23):15112ā15117.
