How the Avidity of Polymerase Binding to the -35/-10 Promoter Sites   Affects Gene Expression

Tal Einav; Rob Phillips

arXiv:1904.01847·q-bio.SC·October 12, 2022

How the Avidity of Polymerase Binding to the -35/-10 Promoter Sites Affects Gene Expression

Tal Einav, Rob Phillips

PDF

Open Access

TL;DR

This paper introduces a refined energy matrix model that incorporates avidity effects between -35 and -10 promoter sites, improving predictions of gene expression and revealing how multivalent binding influences transcription regulation.

Contribution

It demonstrates that avidity between promoter sites is essential for accurate gene expression modeling, challenging the assumption of independent additive effects in previous models.

Findings

01

Avidity significantly affects RNA polymerase binding and gene expression.

02

The refined model accurately predicts expression across diverse promoter sequences.

03

Multivalent binding buffers against mutations and can inhibit expression when too tight.

Abstract

Although the key promoter elements necessary to drive transcription in Escherichia coli have long been understood, we still cannot predict the behavior of arbitrary novel promoters, hampering our ability to characterize the myriad of sequenced regulatory architectures as well as to design novel synthetic circuits. This work builds upon a beautiful recent experiment by Urtecho et al. who measured the gene expression of over 10,000 promoters spanning all possible combinations of a small set of regulatory elements. Using this data, we demonstrate that a central claim in energy matrix models of gene expression - that each promoter element contributes independently and additively to gene expression - contradicts experimental measurements. We propose that a key missing ingredient from such models is the avidity between the -35 and -10 RNA polymerase binding sites and develop what we call a…

Equations14

GE \propto e^{- β (E_{BG} + E_{Spacer} + E_{\mathchar 4535} + E_{\mathchar 4510})} .

GE \propto e^{- β (E_{BG} + E_{Spacer} + E_{\mathchar 4535} + E_{\mathchar 4510})} .

GE = \frac{r _{0} + r _{max} e ^{- β (E_{BG} + E_{Spacer} + E_{\mathchar 4535} + E_{\mathchar 4510})}}{1 + e ^{- β (E_{BG} + E_{Spacer} + E_{\mathchar 4535} + E_{\mathchar 4510})}} .

GE = \frac{r _{0} + r _{max} e ^{- β (E_{BG} + E_{Spacer} + E_{\mathchar 4535} + E_{\mathchar 4510})}}{1 + e ^{- β (E_{BG} + E_{Spacer} + E_{\mathchar 4535} + E_{\mathchar 4510})}} .

GE = \frac{r _{0} + e ^{- β (E_{BG} + E_{Spacer})} ( r _{0} e ^{- β E_{\mathchar 4535}} + r _{0} e ^{- β E_{\mathchar 4510}} + r _{max} e ^{- β (E_{\mathchar 4535} + E_{\mathchar 4510} + E_{int})} )}{1 + e ^{- β (E_{BG} + E_{Spacer})} ( e ^{- β E_{\mathchar 4535}} + e ^{- β E_{\mathchar 4510}} + e ^{- β (E_{\mathchar 4535} + E_{\mathchar 4510} + E_{int})} )} .

GE = \frac{r _{0} + e ^{- β (E_{BG} + E_{Spacer})} ( r _{0} e ^{- β E_{\mathchar 4535}} + r _{0} e ^{- β E_{\mathchar 4510}} + r _{max} e ^{- β (E_{\mathchar 4535} + E_{\mathchar 4510} + E_{int})} )}{1 + e ^{- β (E_{BG} + E_{Spacer})} ( e ^{- β E_{\mathchar 4535}} + e ^{- β E_{\mathchar 4510}} + e ^{- β (E_{\mathchar 4535} + E_{\mathchar 4510} + E_{int})} )} .

GE^{(0, 0)} = GE^{(1, 1)} \frac{GE ^{(0, 1)}}{GE ^{(1, 1)}} \frac{GE ^{(1, 0)}}{GE ^{(1, 1)}} .

GE^{(0, 0)} = GE^{(1, 1)} \frac{GE ^{(0, 1)}}{GE ^{(1, 1)}} \frac{GE ^{(1, 0)}}{GE ^{(1, 1)}} .

GE = \frac{r _{0} + e ^{- β (E_{BG} + E_{Spacer} + E_{UP})} ( r _{0} e ^{- β E_{\mathchar 4535}} + r _{0} e ^{- β E_{\mathchar 4510}} + r _{max} e ^{- β (E_{\mathchar 4535} + E_{\mathchar 4510} + E_{int})} )}{1 + e ^{- β (E_{BG} + E_{Spacer} + E_{UP})} ( e ^{- β E_{\mathchar 4535}} + e ^{- β E_{\mathchar 4510}} + e ^{- β (E_{\mathchar 4535} + E_{\mathchar 4510} + E_{int})} )}

GE = \frac{r _{0} + e ^{- β (E_{BG} + E_{Spacer} + E_{UP})} ( r _{0} e ^{- β E_{\mathchar 4535}} + r _{0} e ^{- β E_{\mathchar 4510}} + r _{max} e ^{- β (E_{\mathchar 4535} + E_{\mathchar 4510} + E_{int})} )}{1 + e ^{- β (E_{BG} + E_{Spacer} + E_{UP})} ( e ^{- β E_{\mathchar 4535}} + e ^{- β E_{\mathchar 4510}} + e ^{- β (E_{\mathchar 4535} + E_{\mathchar 4510} + E_{int})} )}

\frac{r _{max} + r _{0} e ^{- β (Δ E_{RNAP} - Δ E_{trans})}}{1 + e ^{- β (Δ E_{RNAP} - Δ E_{trans})}} .

\frac{r _{max} + r _{0} e ^{- β (Δ E_{RNAP} - Δ E_{trans})}}{1 + e ^{- β (Δ E_{RNAP} - Δ E_{trans})}} .

K_{D}^{eff} = \frac{K _{\mathchar 4535} K _{\mathchar 4510}}{c _{0} + K _{\mathchar 4535} + K _{\mathchar 4510}}

K_{D}^{eff} = \frac{K _{\mathchar 4535} K _{\mathchar 4510}}{c _{0} + K _{\mathchar 4535} + K _{\mathchar 4510}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBacterial Genetics and Biotechnology · RNA and protein synthesis mechanisms · Genomics and Chromatin Dynamics

Full text

How the Avidity of Polymerase Binding to the -35/-10 Promoter Sites Affects Gene Expression

Tal Einav1,∗, Rob Phillips1,2,3,∗

1Department of Physics, California Institute of Technology, Pasadena, CA, 91125, USA

2Department of Applied Physics, California Institute of Technology, Pasadena, CA, 91125, USA

3Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, 91125, USA

*Corresponding author: [email protected]; (512) 468-1994; 1200 E. California Blvd, MC 103-33, Pasadena, CA 91125 *Corresponding author: [email protected]; (626) 395-3374; 1200 E. California Blvd, MC 128-95, Pasadena, CA 91125

Abstract

Although the key promoter elements necessary to drive transcription in Escherichia coli have long been understood, we still cannot predict the behavior of arbitrary novel promoters, hampering our ability to characterize the myriad of sequenced regulatory architectures as well as to design new synthetic circuits. This work builds on a beautiful recent experiment by Urtecho et al. who measured the gene expression of over 10,000 promoters spanning all possible combinations of a small set of regulatory elements. Using this data, we demonstrate that a central claim in energy matrix models of gene expression – that each promoter element contributes independently and additively to gene expression – contradicts experimental measurements. We propose that a key missing ingredient from such models is the avidity between the -35 and -10 RNA polymerase binding sites and develop what we call a refined energy matrix model that incorporates this effect. We show that this the refined energy matrix model can characterize the full suite of gene expression data and explore several applications of this framework, namely, how multivalent binding at the -35 and -10 sites can buffer RNAP kinetics against mutations and how promoters that bind overly tightly to RNA polymerase can inhibit gene expression. The success of our approach suggests that avidity represents a key physical principle governing the interaction of RNA polymerase to its promoter.

Significance Statement

Cellular behavior is ultimately governed by the genetic program encoded in its DNA and through the arsenal of molecular machines that actively transcribe its genes, yet we lack the ability to predict how an arbitrary DNA sequence will perform. To that end, we analyze the performance of over 10,000 regulatory sequences and develop a model that can predict the behavior of any sequence based on its composition. By considering promoters that only vary by one or two elements, we can characterize how different components interact, providing fundamental insights into the mechanisms of transcription.

Introduction

Promoters modulate the complex interplay of RNA polymerase (RNAP) and transcription factor binding that ultimately regulates gene expression. While our knowledge of the molecular players that mediate these processes constantly improves, more than half of all promoters in Escherichia coli still have no annotated transcription factors in RegulonDB [1] and our ability to design novel promoters that elicit a target level of gene expression remains limited.

As a step towards taming the vastness and complexity of sequence space, the recent development of massively parallel reporter assays has enabled entire libraries of promoter mutants to be simultaneously measured [2, 3, 4]. Given this surge in experimental prowess, the time is ripe to reexamine how well our models of gene expression can extrapolate the response of a general promoter.

A common approach to quantifying gene expression, called the energy matrix model, assumes that every promoter element contributes additively and independently to the total RNAP (or transcription factor) binding energy [3]. This model treats all base pairs on an equal footing and does not incorporate mechanistic details of RNAP-promoter interactions such as its strong binding primarily at the -35 and -10 binding motifs (shown in Fig. 1A). A newer method recently took the opposite viewpoint, designing an RNAP energy matrix that only includes the -35 element, -10 element, and the length of the spacer separating them [5], neglecting the sequence composition of the spacer or the surrounding promoter region.

Although these methods have been successfully used to identify important regulatory elements in unannotated promoters [6] and predict evolutionary trajectories [5], it is clear that there is more to the story. Even in the simple case of the highly-studied lac promoter, such energy matrices show systematic deviations from measured levels of gene expressions, indicating that some fundamental component of transcriptional regulation is still missing [7].

We propose that one failure of current models lies in their tacit assumption that every promoter element contributes independently to the RNAP binding energy. By naturally relaxing this assumption to include the important effects of avidity, we can push beyond the traditional energy matrix analysis in several key ways including: (i) We can identify which promoter elements contribute independently or cooperatively without recourse to fitting, thereby building an unbiased mechanistic model for systems that bind at multiple sites. (ii) Applying this approach to RNAP-promoter binding reveals that the -35 and -10 motifs bind cooperatively, a feature that we attribute to avidity. Moreover, we show that models that instead assume the -35 and -10 elements contribute additively and independently sharply contradict the available data. (iii) We show that the remaining promoter elements (the spacer, UP, and background shown in Fig. 1A) do contribute independently and additively to the RNAP binding energy and formulate the corresponding model for transcriptional regulation that we call a refined energy matrix model. (iv) We use this model to explore how the interactions between the -35 and -10 elements can buffer RNAP kinetics against mutations. (v) We analyze a surprising feature of the data where overly-tight RNAP-promoter binding can lead to decreased gene expression. (vi) We validate our model by analyzing the gene expression of over 10,000 promoters in E. coli recently published by Urtecho et al. [8] and demonstrate that our framework markedly improves upon the traditional energy matrix analysis.

While this work focuses on RNAP-promoter binding, its implications extend to general regulatory architectures involving multiple tight-binding elements including transcriptional activators that make contact with RNAP (CRP in the lac promoter [9]), transcription factors that oligomerize (as recently identified for the xylE promoter [6]), and transcription factors that bind to multiple sites on the promoter (DNA looping mediated by the Lac repressor [10]). More generally, this approach of categorizing which binding elements behave independently (without resorting to fitting) can be applied to multivalent interactions in other biological contexts including novel materials, scaffolds, and synthetic switches [11, 12].

Results

The -35 and -10 Binding Sites give rise to Gene Expression that Defies Characterization as Independent and Additive Components

Decades of research have shed light upon the exquisite biomolecular details involved in bacterial transcriptional regulation via the family of RNAP $\sigma$ factors [13]. In this work, we restrict our attention to the $\sigma^{70}$ holoenzyme [8], the most active form under standard E. coli growth condition, whose interaction with a promoter includes direct contact with the -35 and -10 motifs (two hexamers centered roughly 10 and 35 bases upstream of the transcription start site), a spacer region separating these two motifs, an UP element just upstream of the -35 motif that anchors the C-terminal domain ( $\alpha$ CTD) of RNAP, and the background promoter sequence surrounding these elements.

Urtecho et al. constructed a library of promoters composed of every combination of eight -35 motifs, eight -10 motifs, eight spacers, eight backgrounds (BG), and three UP elements (Fig. 1A) [8]. Each sequence was integrated at the same locus within the E. coli genome and transcription was quantified via DNA barcoding and RNA sequencing. One of the three UP elements considered was the absence of an UP binding motif, and this case will serve as the starting point for our analysis.

The traditional energy matrix approach used by Urtecho et al. posits that every base pair of the promoter will contribute additively and independently to the RNAP binding energy [8], which by appropriately grouping base pairs is equivalent to stating that the free energy of RNAP binding will be the sum of its contributions from the background, spacer, -35, and -10 elements (see Appendix A). Hence, the gene expression (GE) is given by the Boltzmann factor

[TABLE]

Note that all $E_{j}$ s represent free energies (with an energetic and entropic component); to see the explicit dependence on RNAP copy number, refer to Appendix A. Fitting the 32 free energies (one for each background, spacer, -35, and -10 element) and the constant of proportionality in Eq. 1 enables us to predict the expression of $8\times 8\times 8\times 8=4{,}096$ promoters.

Fig. 2A demonstrates that Eq. 1 leads to a poor characterization of these promoters ( $R^{2}=0.57$ , parameter values listed in Appendix B), suggesting that critical features of gene expression are missing from this model. One possible resolution is to assumes that the level of gene expression saturates for very strong promoters at $r_{\text{max}}$ and for very weak promoters at $r_{0}$ (caused by background noise or serendipitous near-consensus sequences [5]), namely,

[TABLE]

Since Eq. 2 still assumes that each promoter element contributes additively and independently to the total RNAP binding energy, it also makes sharp predictions that markedly disagrees with the data (see Appendix C). Inspired by these inconsistencies, we postulated that certain promoter elements, most likely the -35 and -10 sites, may not contribute synergistically to RNAP binding.

To that end, we consider a model for gene expression shown in Fig. 1B where RNAP can separately bind to the -35 and -10 sites. RNAP is assumed to elicit a large level of gene expression $r_{\text{max}}$ when fully bound but the smaller level $r_{0}$ when unbound or partially bound. Importantly, the Boltzmann weight of the fully bound state contains the free energy $E_{\text{int}}$ representing the avidity of RNAP binding to the -35 and -10 sites. Physically, avidity arises because unbound RNAP binding to either the -35 or -10 sites gains energy but loses entropy, while this singly bound RNAP attaching at the other (-10 or -35) site again gains energy but loses much less entropy, as it was tethered in place rather than floating in solution. Hence we expect $e^{-\beta E_{\text{int}}}\gg 1$ , and including this avidity term implies that RNAP no longer binds independently to the -35 and -10 sites.

Our coarse-grained model of gene expression neglects the kinetic details of transcription whereby RNAP transitions from the closed to open complex before initiating transcription. Instead, we assume that there is a separation of timescales between the fast process of RNAP binding/unbinding to the promoter and the other processes that constitute transcription. In the quasi-equilibrium framework shown in Fig. 1B, gene expression is given by the average occupancy of RNAP in each of its states, namely,

[TABLE]

We call this expression a refined energy matrix model since it reduces to the energy matrix Eq. 1 (with constant of proportionality $r_{\text{max}}E^{-\beta E_{\text{int}}}$ ) in the limit where gene expression is negligible when the RNAP is not bound ( $r_{0}\approx 0$ ) and the promoter is sufficiently weak or the RNAP concentration is sufficiently small that polymerase is most often in the unbound state (so that the denominator $\approx 1$ ). The background and spacer are assumed to contribute to RNAP binding in both the partially and fully bound states, an assumption that we rigorously justify in Appendix D.

Fig. 2B demonstrates that the refined energy matrix Eq. 3 is better able to capture the system’s behavior ( $R^{2}=0.91$ ) while only requiring two more parameters ( $r_{0}$ and $E_{\text{int}}$ ) than the energy matrix model Eq. 1. The sharp boundaries on the left and right represent the minimum and maximum levels of gene expression, $r_{0}=0.18$ and $r_{\text{max}}=8.6$ , respectively (see Appendix E). The refined energy matrix predicts that the top 5% of promoters will exhibit expression levels of 7.6 (compared to 8.5 measured experimentally) while the weakest 5% of promoters should express at 0.2 (compared to the experimentally measured 0.1). In addition, this model quickly gains predictive power, as its coefficient of determination only slightly diminishes ( $R^{2}=0.86$ ) if the model is trained on only 10% of the data and used to predict the remaining 90%.

Epistasis-Free Models of Gene Expression Lead to Sharp Predictions that Disagree with the Data

To further validate that the lower coefficient of determination of the energy matrix approach (Eq. 1) was not an artifact of the fitting, we can utilize the epistasis-free nature of this model to predict the gene expression of double mutants from that of single mutants. More precisely, denote the gene expression $\text{GE}^{(0,0)}$ of a promoter with the consensus -35 and -10 sequences (and any background or spacer sequence). Let $\text{GE}^{(1,0)}$ , $\text{GE}^{(0,1)}$ , and $\text{GE}^{(1,1)}$ represent promoters (with this same background and spacer) whose -35/-10 sequences are mutated/consensus, consensus/mutated, and mutated/mutated, respectively, where “mutated” stands for any non-consensus sequence. As derived in Appendix D, the gene expression of these three later sequences can predict the gene expression of the promoter with the consensus -35 and -10 without recourse to fitting, namely,

[TABLE]

The inset in Fig. 2A compares the epistasis-free predictions ( $x$ -axis, right-hand side of Eq. 4) with the measured gene expression ( $y$ -axis, left-hand side of Eq. 4). These results demonstrate that the simple energy matrix formulation fails to capture the interaction between the -35 and -10 binding sites. While this calculation cannot readily generalize to the refined energy matrix model since it exhibits epistasis, it is analytically tractable for weak promoters where the refined energy matrix model displays a marked improvement over the traditional energy matrix model (see Appendix C).

RNAP Binding to the UP Element occurs Independently of the Other Promoter Elements

Having seen that the refined energy matrix model (Eq. 3) can outperform the traditional energy matrix analysis on promoters with no UP element, we next extend the former model to promoters containing an UP element. Given the importance of the RNAP interactions with the -35 and -10 sites seen above, Fig. 3A shows three possible mechanisms for how the UP element could mediate RNAP binding. For example, the C-terminal could bind strongly and independently so that RNAP has three distinct binding sites. Another possibility is that the RNAP $\alpha$ CTD binds if and only if the -35 binding site is bound. A third alternative is that the UP element contributes additively and independently to RNAP binding (analogous to the spacer and background).

To distinguish between these possibilities, we analyze the correlations in gene expression between every pair of promoter elements (UP and -35, spacer and background, etc.) to determine the strength of their interaction. Each model in Fig. 3A will have a different signature: The top schematic predicts strong interactions between the -35 and -10, between the UP and -35, and between the UP and -10; the middle schematic would give rise to strong dependence between the -35 and -10 as well as between the UP and -10, while the UP and -35 elements would be perfectly correlated; the bottom schematic suggests that the UP elements will contribute independently of the other promoter elements.

This analysis, which we relegate to Appendix D, demonstrates that the UP element is approximately independent of all other promoter elements ( $R^{2}\gtrsim 0.6$ ) as are the background and spacer, indicating that the bottom schematic in Fig. 3A characterizes the binding of the UP element. This leads us to the general form of transcriptional regulation by RNAP, shown in Eq. 5.

[TABLE]

Fig. 3B demonstrates how the expression of all promoters containing one of the two UP elements combined with each of the eight background, spacer, -35, and -10 sequences ( $2\times 8^{4}=8{,}192$ promoters) closely matches the model predictions ( $R^{2}=0.88$ ). Remarkably, since we used the same free energies and gene expression rates from Fig. 2B, characterizing these 8,192 promoters only required two additional parameters (the free energies of the two UP elements). This result emphasizes how understanding each modular component of gene expression can enable us to harness the combinatorial complexity of sequence space.

Sufficiently Strong RNAP-Promoter Binding Energy can Decrease Gene Expression

Although the 12,288 promoters considered above are well characterized by Eq. 5 on average, the data demonstrate that the full mechanistic picture is more nuanced. For example, Urtecho et al. found that gene expression (averaged over all backgrounds and spacers) generally increases for -35/-10 elements closer to the consensus sequences [8]. In terms of the gene expression models studied above (Eqs. 1-3), promoters with fewer -35/-10 mutations have more negative free energies $E_{\mathchar 45\relax 35}$ and $E_{\mathchar 45\relax 10}$ leading to larger expression. Yet the strongest promoters with the consensus -35/-10 violated this trend, exhibiting less expression than promoters one mutation away. Thus, Urtecho et al. postulated that past a certain point, promoters that bind RNAP too tightly may inhibit transcription initiation and lead to decreased gene expression.

The promoters with a consensus -35/-10 are shown as red points in Fig. 3B, and indeed these promoters are all predicted to bind tightly to RNAP and hence express at the maximum level $r_{\text{max}}=8.6$ , placing them on the right-edge of the data. Yet depending on their UP, background, and spacer, many of these promoters exhibit significantly less gene expression then expected. Motivated by this trend, we posit that the state of transcription initiation can be characterized by a free energy $\Delta E_{\text{trans}}$ relative to unbound RNAP that competes with the free energy $\Delta E_{\text{RNAP}}$ between fully bound and unbound RNAP (see Appendix E).

Assuming the rate of transcription initiation is proportional to the relative Boltzmann weights of these two states, the level of gene expression $r_{\text{max}}$ in Eq. 5 will be modified to

[TABLE]

As expected, this expression reduces to $r_{\text{max}}$ for promoters that weakly bind RNAP ( $e^{-\beta(\Delta E_{\text{RNAP}}-\Delta E_{\text{trans}})}\ll 1$ ) but decreases for strong promoters until it reaches the background level $r_{0}$ when the promoter binds so tightly that RNAP is glued in place and unable to initiate transcription. Upon reanalyzing the gene expression data with the inferred value $\Delta E_{\text{trans}}=-6.2\,k_{B}T$ (see Appendix E), we can plot the measured level of gene expression against the predicted RNAP-promoter free energy $\Delta E_{\text{RNAP}}$ as shown in Fig. 4 (stronger promoters to the right). We find that this revised model captures the downwards trend in gene expression observed for the strongest promoters, most of which contain a consensus -35/-10.

The Bivalent Binding of RNAP Buffers its Binding Behavior against Promoter Mutations

In this final section, we investigate how the avidity between the -35 and -10 sites changes the dynamics of RNAP binding. More specifically, we consider the effective dissociation constant governing RNAP binding when both the -35 and -10 sites are intact and compare it to the case where only one site is capable of binding. To simplify this discussion, we focus exclusively on the case of RNAP binding to the -35 and -10 motifs as shown in the rates diagram Fig. 1C, absorbing the effects of the background, spacer, and UP elements into these rates.

At equilibrium, there is no flux between the four RNAP states. We define the effective dissociation constant

[TABLE]

which represents the concentration of RNAP at which there is a 50% likelihood that the promoter is bound (see Appendix F). $K_{j}=\frac{k_{\text{off},j}}{k_{\text{on}}}$ stands for the dissociation constant of free RNAP binding to the site $j$ and $c_{0}=\frac{\tilde{k}_{\text{on}}}{k_{\text{on}}}=[\text{RNAP}]e^{-\beta E_{\text{int}}}$ represents the increased local concentration of singly bound RNAP transitioning to the fully bound state (i.e., $E_{\text{int}}$ and $c_{0}$ are the embodiments of avidity in the language of statistical mechanics and thermodynamics, respectively). Note that $K_{D}^{\text{eff}}$ is a sigmoidal function of $K_{\mathchar 45\relax 10}$ with height $K_{\mathchar 45\relax 35}$ and midpoint at $K_{\mathchar 45\relax 10}=c_{0}+K_{\mathchar 45\relax 35}$ .

Fig. 5 demonstrates how the effective RNAP dissociation constant $K_{D}^{\text{eff}}$ changes when mutations to the -10 binding motif alter its dissociation constant $K_{\mathchar 45\relax 10}$ . When the -35 sequence is weak (dashed lines, $k_{\text{off},\mathchar 45\relax 35}\to\infty$ ), $K_{D}^{\text{eff}}\approx K_{\mathchar 45\relax 10}$ signifying that RNAP binding relies solely on the strength of the -10 site. In the opposite limit where RNAP tightly binds to the -35 sequence (solid lines), the cooperativity $c_{0}$ and the dissociation constant $K_{\mathchar 45\relax 35}$ shift the curve horizontally and bound the effective dissociation constant to $K_{D}^{\text{eff}}\leq K_{\mathchar 45\relax 35}$ . This upper bound may buffer promoters against mutations, since achieving a larger effective dissociation constant would require not only wiping out the -35 site but in addition mutating the -10 site. Finally, in the case where the cooperativity $c_{0}$ is large, $K_{D}^{\text{eff}}\approx\frac{K_{\mathchar 45\relax 10}K_{\mathchar 45\relax 35}}{c_{0}}$ indicating that as soon as one site of the RNAP binds, the other is very likely to also bind, thereby giving rise to the multiplicative dependence on the two $K_{D}$ s.

To get a sense for how these numbers translate into physiological RNAP dwell times on the promoter, we note that the lifetime of bound RNAP is given by $\tau=\frac{1}{K_{D}^{\text{eff}}k_{\text{on}}}$ (see Appendix F). Using $K_{D}^{\text{eff}}\approx 10^{-9}\,\text{M}$ for the strong T7 promoter [14] and assuming a diffusion-limited on-rate $10^{8}\frac{1}{\text{M}\cdot\text{s}}$ leads to a dwell time of $10\,\text{s}$ , comparable to the measured dwell time of RNAP-promoter in the closed complex [15]. It would be fascinating if recently developed methods that visualize real-time single-RNAP binding events probed the dwell time of the promoter constructed by Urtecho et al. to see how well the predictions of the refined energy matrix model match experimental measurements [15].

Discussion

While high-throughput methods have enabled us to measure the gene expression of tens of thousands of promoters, they nevertheless only scratch the surface of the full sequence space. A typical promoter composed of 200 bp has $4^{200}$ variants (more than the number of atoms in the universe). Nevertheless, by understanding the principles governing transcriptional regulation, we can begin to cut away at this daunting complexity to design better promoters.

In this work, we analyzed a recent experiment by Urtecho et al. measuring gene expression of over 10,000 promoters in E. coli using the $\sigma^{70}$ RNAP holoenzyme [8]. These sequences comprised all combinations of a small set of promoter elements, namely, eight -10s, eight -35s, eight spacers, eight backgrounds, and three UPs depicted in Fig. 1A, providing an opportunity to deepen our understanding of how these elements interact and to compare different quantitative models of gene expression.

We first analyzed this data using classic energy matrix models which posit that each promoter element contributes independently to the RNAP-promoter binding energy. As emphasized by Urtecho et al. and other groups, such energy matrices poorly characterize gene expression (Fig. 2A, $R^{2}=0.57$ ) and offer testable predictions that do not match the data (Appendix C), mandating the need for other approaches [8, 7].

To meet this challenge, we first determined which promoter elements contribute independently to RNAP binding (Appendix D). This process, which was done without recourse to fitting, demonstrated that the -35 and -10 elements bind in a concerted manner that we postulated is caused by avidity. In this context, avidity implies that when RNAP is singly bound to either the -35 or -10 sites, it is much more likely (compared to unbound RNAP) to bind to the other site, similar to the boost in binding seen in bivalent antibodies [16] or multivalent systems [17, 18, 12]. Surprisingly, we found that outside the -35/-10 pair, the other components of the promoter contributed independently to RNAP binding.

Using these findings, we developed a refined energy matrix model of gene expression (Eq. 5) that incorporates the avidity of between the -35/-10 sites as well as the independence of the UP/spacer/background interactions. This model was able to characterize the 4,096 promoters with no UP element (Fig. 2B, $R^{2}=0.91$ ) and the 8,192 promoters containing an UP element (Fig. 3B, $R^{2}=0.88$ ). These results surpass those of the traditional energy matrix model (Fig. 2A, $R^{2}=0.57$ ), only requiring two additional parameters that could be experimentally determined (e.g., the interaction energy $E_{\text{int}}$ arising from the -35/-10 avidity and the level of gene expression $r_{0}$ of a promoter with a scrambled -10 motif, a scrambled -35 motif, or with both motifs scrambled).

These promising findings suggest that determining which components bind independently is crucial to properly characterize multivalent systems. It would be fascinating to extend this study to RNAP with other $\sigma$ factors [13] as well as to RNAP mutants with no $\alpha$ CTD or that do not bind at the -35 site [19, 20]. Our model would predict that polymerases in this last category with at most one strong binding site should conform to a traditional energy matrix approach.

Quantitative frameworks such as the refined energy matrix model explored here can deepen our understanding of the underlying mechanisms governing a system’s behavior. For example, while searching for systematic discrepancies between our model prediction and the gene expression measurements, we found that promoters predicted to have the strongest RNAP affinity did not exhibit the largest levels of gene expression (thus violating a core assumption of nearly all models of gene expression that we know of). This led us to posit a characteristic energy for transcription initiation that reduces the expression of overly strong promoters (Fig. 4). In addition, we explored how having separate binding sites at the -35 and -10 elements buffers RNAP kinetics against mutations; for example, no single mutation can completely eliminate gene expression of a strong promoter with the consensus -35 and -10 sequence, since at least one mutation in both the -35 and -10 motifs would be needed (Fig. 5).

Finally, we end by zooming out from the particular context of transcription regulation and note that multivalent interactions are prevalent in all fields of biology [21], and our work suggests that differentiating between independent and dependent interactions may be key to not only characterizing overall binding affinities but to also understand the dynamics of a system [22]. Such formulations may be essential when dissecting the much more complicated interactions in eukaryotic transcription where large complexes bind at multiple DNA loci [23, 24] and more broadly in multivalent scaffolds and materials [11, 12].

Methods

We trained both the standard and refined energy matrix models on 75% of the data and characterized the predictive power on the remaining 25%, repeating the procedure 10 times. The coefficient of determination $R^{2}$ was calculated for $y_{\text{data}}=\log_{10}(\text{gene expression})$ to prevent the largest gene expression values from dominating the result. The supplementary Mathematica notebook contains the data analyzed in this work and can recreate all plots.

Acknowledgements

We thank Suzy Beeler, Vahe Galstyan, Peng (Brian) He, and Zofii Kaczmarek for helpful discussions. This work was supported by the Rosen Center at Caltech and the National Institutes of Health through 1R35 GM118043-01 (MIRA).

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Socorro Gama-Castro, Heladia Salgado, Alberto Santos-Zavaleta, Daniela Ledezma-Tejeida, Luis Muniz-Rascado, Jair Santiago Garcia-Sotelo, Kevin Alquicira-Hernandez, Irma Martinez-Flores, Lucia Pannier, Jaime Abraham Castro-Mondragon, Alejandra Medina-Rivera, Hilda Solano-Lira, Cesar Bonavides-Martinez, Ernesto Perez-Rueda, Shirley Alquicira-Hernandez, Liliana Porron-Sotelo, Alejandra Lopez-Fuentes, Anastasia Hernandez-Koutoucheva, Victor Del Moral-Chavez, Fabio Rinaldi and Julio Collado-Vid · doi ↗
2[2] Rupali P Patwardhan, Choli Lee, Oren Litvin, David L Young, Dana Pe’er and Jay Shendure “High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis” In Nature Biotechnology 27.12 Nature Publishing Group, 2009, pp. 1173–1175 DOI: 10.1038/nbt.1589 · doi ↗
3[3] J. B. Kinney, A. Murugan, C. G. Callan and E. C. Cox “Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence” In Proceedings of the National Academy of Sciences 107.20 , 2010, pp. 9158–9163 DOI: 10.1073/pnas.1004290107 · doi ↗
4[4] Fumitaka Inoue and Nadav Ahituv “Decoding enhancers using massively parallel reporter assays” In Genomics 106.3 Academic Press, 2015, pp. 159–164 DOI: 10.1016/J.YGENO.2015.06.005 · doi ↗
5[5] Avihu H. Yona, Eric J. Alm and Jeff Gore “Random sequences rapidly evolve into de novo promoters” In Nature Communications 9.1 Springer US, 2018, pp. 1530 DOI: 10.1038/s 41467-018-04026-w · doi ↗
6[6] Nathan M. Belliveau, Stephanie L. Barnes, William T. Ireland, Daniel L. Jones, Michael J. Sweredoski, Annie Moradian, Sonja Hess, Justin B. Kinney and Rob Phillips “Systematic approach for dissecting the molecular mechanisms of transcriptional regulation in bacteria” In Proceedings of the National Academy of Sciences , 2018 DOI: 10.1073/pnas.1722055115 · doi ↗
7[7] Talitha Forcier, Andalus Ayaz, Manraj Gill and Justin B Kinney “Precision measurements of regulatory energetics in living cells” In bio Rxiv , 2018
8[8] Guillaume Urtecho, Arielle D Tripp, Kimberly Insigne, Hwangbeom Kim and Sriram Kosuri “Systematic Dissection of Sequence Elements Controlling σ 𝜎 \sigma 70 Promoters Using a Genomically-Encoded Multiplexed Reporter Assay in E. coli” In Biochemistry , 2018, pp. acs.biochem.7b 01069 DOI: 10.1021/acs.biochem.7b 01069 · doi ↗