Towards a theory of assembly of protein complexes: lessons from   equilibrium statistical physics

Pablo Sartori; Stanislas Leibler

arXiv:1907.02992·physics.bio-ph·July 9, 2019

Towards a theory of assembly of protein complexes: lessons from equilibrium statistical physics

Pablo Sartori, Stanislas Leibler

PDF

Open Access

TL;DR

This paper develops an equilibrium thermodynamic model to understand how cells reliably assemble diverse protein complexes despite physical and cellular noise, identifying key conditions for error-free assembly.

Contribution

It introduces a theoretical framework with four assembly regimes and identifies conditions for error-free complex formation, supported by analysis of real protein data.

Findings

01

Four distinct assembly behaviors identified

02

Heterogeneity and sparsity are key for error-free assembly

03

Cellular protein systems likely evolved to meet these conditions

Abstract

Cellular functions are established through biological evolution, but are constrained by the laws of physics. For instance, the physics of protein folding limits the lengths of cellular polypeptide chains. Consequently, many cellular functions are carried out not by long, isolated proteins, but rather by multi-protein complexes. Protein complexes themselves do not escape physical constraints, one of the most important being the difficulty to assemble reliably in the presence of cellular noise. In order to lay the foundation for a theory of reliable protein complex assembly, we study here an equilibrium thermodynamic model of self-assembly that exhibits four distinct assembly behaviors: diluted protein solution, liquid mixture, "chimeric assembly" and "multifarious assembly". In the latter regime, different protein complexes can coexist without forming erroneous chimeric structures. We…

Tables1

Table 1. Table 1: Summary of datasets used in this work. Datasets I − V 𝐼 𝑉 I-V involved some degree of post-processing, as detailed in the text. For dataset V 𝑉 V we only used the “core” dataset, which excludes redundancies. As described in Wan et al. ( 2015 ) , dataset V I I 𝑉 𝐼 𝐼 VII is based on human cell data, but interactions among proteins were added provided they were confirmed for at least two other species among the five species studied.

dataset	source	method	organism	stoichiometry	symmetry	$K$	$N_{tot}$
$I$	Meldal et al. (2014)	manual	S. cerevisiae	Yes	No	$238$	$746$
$I I$	Meldal et al. (2014)	manual	S. cerevisiae	Yes	Yes	$59$	$173$
$I I I$	Meldal et al. (2014)	manual	S. cerevisiae	No	No	$589$	$1895$
$I V$	Krogan et al. (2006)	high-througput	S. cerevisiae	No	No	$664$	$2133$
$V$	Hart et al. (2007)	meta-analysis	S. cerevisiae	No	No	$518$	$3457$
$V I$	Giurgiu et al. (2018)	manual	Multi-organism	No	No	$2652$	$4625$
$V I I$	Wan et al. (2015)	high-throughput	Multi-organism	No	No	$982$	$2154$
$V I I I$	Drew et al. (2017)	meta-analysis	Human	No	No	$4659$	$7778$

Equations41

p_{max} / p_{min} \approx exp (E + μ_{0}),

p_{max} / p_{min} \approx exp (E + μ_{0}),

π_{α} \sim K (1 - a_{c}) / h_{c},

π_{α} \sim K (1 - a_{c}) / h_{c},

K ≲ h_{c}^{3/2} M_{c}^{1/2} (1 - a_{c})^{- 3/2},

K ≲ h_{c}^{3/2} M_{c}^{1/2} (1 - a_{c})^{- 3/2},

K \sim N_{tot}^{3/2} M_{c}^{- 1} .

K \sim N_{tot}^{3/2} M_{c}^{- 1} .

U_{α β δ} = - E Θ (c = 1 \sum K B_{α β δ}^{c}),

U_{α β δ} = - E Θ (c = 1 \sum K B_{α β δ}^{c}),

U = ⟨ i, j ⟩ \sum α, β \sum U_{α β δ (i, j)} σ_{α i} σ_{β j} - i \sum α \sum μ_{α} σ_{α i},

U = ⟨ i, j ⟩ \sum α, β \sum U_{α β δ (i, j)} σ_{α i} σ_{β j} - i \sum α \sum μ_{α} σ_{α i},

Err_{sat} = 1 - γ^{2} \frac{M}{L} .

Err_{sat} = 1 - γ^{2} \frac{M}{L} .

E = - μ /2,

E = - μ /2,

W_{chim} \approx (4 M \frac{K}{M} \frac{exp ( E + μ )}{1 + exp ( E + μ )}) (2 \frac{K ^{2}}{M} \frac{exp ( 2 E + μ )}{1 + exp ( 2 E + μ )}) .

W_{chim} \approx (4 M \frac{K}{M} \frac{exp ( E + μ )}{1 + exp ( E + μ )}) (2 \frac{K ^{2}}{M} \frac{exp ( 2 E + μ )}{1 + exp ( 2 E + μ )}) .

E = - μ + μ_{0},

E = - μ + μ_{0},

P_{no int} \approx (1 - 1/ N)^{M / N} = (1 - ϵ / h)^{1/ h} \approx 1 - ϵ / h^{2} + O (ϵ^{2}) .

P_{no int} \approx (1 - 1/ N)^{M / N} = (1 - ϵ / h)^{1/ h} \approx 1 - ϵ / h^{2} + O (ϵ^{2}) .

h ≳ h_{0} \approx M^{- 1/2}

h ≳ h_{0} \approx M^{- 1/2}

π \approx K (1 - a) / h,

π \approx K (1 - a) / h,

P_{\cap} \approx 1 - (1 - \frac{π}{N _{tot}})^{π} = 1 - (1 - ϵ \frac{π ( 1 - a )}{h})^{π} = ϵ \frac{π ^{2} ( 1 - a )}{h} + O (ϵ^{2}) = ϵ \frac{K ^{2} ( 1 - a ) ^{3}}{h ^{3}} + O (ϵ^{2}),

P_{\cap} \approx 1 - (1 - \frac{π}{N _{tot}})^{π} = 1 - (1 - ϵ \frac{π ( 1 - a )}{h})^{π} = ϵ \frac{π ^{2} ( 1 - a )}{h} + O (ϵ^{2}) = ϵ \frac{K ^{2} ( 1 - a ) ^{3}}{h ^{3}} + O (ϵ^{2}),

K^{2} ≲ \frac{h ^{3} M}{( 1 - a ) ^{3}} .

K^{2} ≲ \frac{h ^{3} M}{( 1 - a ) ^{3}} .

P_{\cap} \approx 1 - (1 - \frac{π ^{z /2 - 1}}{N _{tot}^{z /2 - 1}})^{π} = ϵ^{z /2 - 1} \frac{π ^{z /2} ( 1 - a ) ^{z /2 - 1}}{h ^{z /2 - 1}} + O (ϵ^{z - 2}) = ϵ^{z /2 - 1} \frac{K ^{z /2} ( 1 - a ) ^{z - 1}}{h ^{z - 1}} + O (ϵ^{z - 2}),

P_{\cap} \approx 1 - (1 - \frac{π ^{z /2 - 1}}{N _{tot}^{z /2 - 1}})^{π} = ϵ^{z /2 - 1} \frac{π ^{z /2} ( 1 - a ) ^{z /2 - 1}}{h ^{z /2 - 1}} + O (ϵ^{z - 2}) = ϵ^{z /2 - 1} \frac{K ^{z /2} ( 1 - a ) ^{z - 1}}{h ^{z - 1}} + O (ϵ^{z - 2}),

K^{z /2} ≲ \frac{h ^{z - 1}}{( 1 - a ) ^{z - 1}} M^{z /2 - 1} .

K^{z /2} ≲ \frac{h ^{z - 1}}{( 1 - a ) ^{z - 1}} M^{z /2 - 1} .

P_{\cap} \approx 1 - (1 - \frac{π ^{2}}{N _{tot}^{2}})^{π} = ϵ^{2} \frac{K ^{3} ( 1 - a ) ^{5}}{h ^{5}} + O (ϵ^{4}),

P_{\cap} \approx 1 - (1 - \frac{π ^{2}}{N _{tot}^{2}})^{π} = ϵ^{2} \frac{K ^{3} ( 1 - a ) ^{5}}{h ^{5}} + O (ϵ^{4}),

K^{3} ≲ \frac{h ^{5} M ^{2}}{( 1 - a ) ^{5}}

K^{3} ≲ \frac{h ^{5} M ^{2}}{( 1 - a ) ^{5}}

MA \to Ch :

MA \to Ch :

Ch \to L :

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsProtein Structure and Dynamics · Stochastic processes and statistical mechanics · Theoretical and Computational Physics

Full text

Towards a theory of assembly of protein complexes: lessons

from equilibrium statistical physics

Pablo Sartori

The Simons Center for Systems Biology, School of Natural Sciences, Institute for Advanced Study, Einstein Drive, Princeton, NJ 08540, U.S.A.

Center for Studies in Physics and Biology and Laboratory of Living Matter, The Rockefeller University, 1230 York Ave, New York, NY 01065, U.S.A.

Stanislas Leibler

The Simons Center for Systems Biology, School of Natural Sciences, Institute for Advanced Study, Einstein Drive, Princeton, NJ 08540, U.S.A.

Center for Studies in Physics and Biology and Laboratory of Living Matter, The Rockefeller University, 1230 York Ave, New York, NY 01065, U.S.A.

Abstract

Cellular functions are established through biological evolution, but are constrained by the laws of physics. For instance, the physics of protein folding limits the lengths of cellular polypeptide chains. Consequently, many cellular functions are carried out not by long, isolated proteins, but rather by multi-protein complexes. Protein complexes themselves do not escape physical constraints, one of the most important being the difficulty to assemble reliably in the presence of cellular noise. In order to lay the foundation for a theory of reliable protein complex assembly, we study here an equilibrium thermodynamic model of self-assembly that exhibits four distinct assembly behaviors: diluted protein solution, liquid mixture, “chimeric assembly” and “multifarious assembly”. In the latter regime, different protein complexes can coexist without forming erroneous chimeric structures. We show that two conditions have to be fulfilled to attain this regime: ( $i$ ) the composition of the complexes needs to be sufficiently heterogeneous, and ( $ii$ ) the use of the set of components by the complexes has to be sparse. Our analysis of publicly available databases of protein complexes indicates that cellular protein systems might have indeed evolved so to satisfy both of these conditions.

Introduction

Protein complexes are assembled with high compositional accuracy, evidenced e.g. by the possibility of crystalization of complexes as large as the ribosome Ban et al. (2000). This is remarkable, because during assembly a growing complex has to discriminate its specific components from a multicomponent mixture of hundreds of different protein species that are part of the proteome. Failure to solve this discriminatory task could result in assembly of chimeric structures composed of fragments from different complexes, impairing normal cellular function Garcia-Seisdedos et al. (2018).

Assembly of protein complexes can also be viewed as a second stage of creating functional cellular structures, the first being the assemblage of amino acids into proteins, achieved by ribosomes. A modest alphabet of twenty amino acids encodes thousands of different proteins. Proteins typically contain all twenty amino acids, so that the amino acid usage by proteins is “dense” rather than “sparse”. Nature, furthermore, reuses amino acids many times within the same protein, which makes the compositional heterogeneity of each protein low. This can be contrasted with the assembly of complexes, which seem to use proteins sparsely, so that each complex contains only a small fraction of the available proteome. At the same time, complexes are often highly heterogeneous, i.e. composed of many different protein species Reuveni et al. (2017).

The sparsity and heterogeneity of complexes should come as a surprise, as they imply that the proteome might not be exploited in combinatorial manner. Indeed, the vast repertoire of hundreds of proteins is combined to result in a comparable number of complexes, see Fig. 1. This suggests that “combinatorial expansion” of proteins into complexes does not occur generically, and may instead be restricted to particular functions, such as regulation or signaling Stein et al. (2009); Antebi et al. (2017). In these cases, proteins participate in several complexes, e.g. cyclin-dependent kinases can be part of several cell cycle regulatory complexes Murray (2004). Proteins can have specific interactions with many partners, a phenomenon known as promiscuity. The promiscuity of proteins may potentially result in the formation of disordered chimeric structures. For example, a single point mutation is sufficient to create a novel protein-protein interaction, which can result in chimeric assembly of proteins Garcia-Seisdedos et al. (2017). Notwithstanding these challenges, protein complexes typically assemble from their constituents accurately and carry out cellular functions with remarkable speed and precision Johansson et al. (2012).

Elucidating the characteristics of protein complexes that enable them to assemble reliably, and studying how these characteristics affect the organization of the proteome, can be viewed as fundamental goals of cell biology. Recently, there have been significant advances towards achieving these goals due to the progress in experiments Mulder et al. (2010); Gavin et al. (2006); Garcia-Seisdedos et al. (2017), bioinformatics Levy et al. (2006); Ahnert et al. (2015) and molecular dynamics simulations Perilla et al. (2015). However, a general theoretical framework to understand protein complex formation and usage is still lacking. One major difficulty in developing such a framework is the large diversity of cellular protein complexes. Some complexes, such as microtubules, exhibit unbounded growth Howard (2001). Others, such as ribosomes, have a well-defined finite size Ban et al. (2000). To complicate matters further, the latter complexes can be further divided among those that exhibit strong symmetries, such as the bacterial flagellar motor Berg (2003), and those that are fully asymmetric, such as ribosomes Ban et al. (2000). Whereas the principles of assembly of many symmetric complexes have been studied Ahnert et al. (2015), the same is not true for asymmetric complexes.

The aim of this article is to begin to develop a theoretical framework, which could ultimately be applied to (asymmetric) protein complexes, by extending a recent model of self-assembly Murugan et al. (2015). As it will be emphasized in the Discussion section below, cellular assembly is typically a highly controlled, non-equilibrium kinetic process. Still we will constrain our present theoretical study to equilibrium thermal physics alone and explore what constraints thermodynamics imposes on assembly of complexes. Interestingly, we shall see that these constraints alone can –at least partly– explain the observed heterogeneity of asymmetric complexes and their sparse usage of the proteome. We will also analyze existing structural, compositional, and interaction data of protein complexes to further evaluate some biological implications of our theoretical findings.

Results

Multifarious mixtures of components exhibit four assembly regimes

In our model, detailed in Materials & Methods, protein-like components form a multicomponent mixture. Two components forming part of a same complex interact with binding energy $E$ when they are in close proximity. Conversely, we assume that components not forming part of a same complex have a null binding energy. Such components still can interact non-specifically, provided their concentration, $p$ , is large. This model has been formulated and studied previously in Ref. Murugan et al. (2015). We extend its analysis to allow for variable heterogeneity and sparsity.

Just like changing the temperature and pressure of a gas can turn it into a liquid, changing the chemical potential $\mu=\log(p)$ and the binding energy $E$ of the component mixture can fundamentally alter its properties (here energy is expressed in units of $k_{\rm B}T$ , with $T$ the temperature and $k_{\rm B}$ Boltzmann’s constant). As shown in Fig. 2 (which describes a lattice implementation of our model, see also Supplementary Information A, SI A), for low negative values of the chemical potential the mixture is in a dilute solution (DS) regime, in which components interact only transiently with each other. This is the regime in which biochemical reactions have been traditionally studied Hill (1989). If, on the other hand, the chemical potential is high but the binding energy is low, the mixture increases its density and behaves as a liquid (L). The properties of this liquid are somewhat similar to those of multi-component droplets that have recently gained prominence in cell biology Brangwynne et al. (2009); Sear and Cuesta (2003). Finally, if both the chemical potential and the binding energy are high, the liquid mixture changes into a “chimeric” (Ch) regime, in which fragments of several protein complexes bind unruly to each other. This regime can evoke cellular inclusion bodies, where over-expressed recombinant proteins form disordered solid aggregates Kopito (2000). These three regimes are conceptually close to phases of inert materials, and will not be discussed here further (see, however, SI A–D).

We shall rather focus our attention on a more biologically interesting regime, which arises when the values of binding energy and chemical potential are comparable. In this “multifarious assembly” (MA) regime a large protein complex can self-assemble accurately, e.g. starting (nucleating) from a small part of such complex (nucleation seed), see Fig. 2 Murugan et al. (2015). One fundamental factor that determines the boundaries of the MA regime is the binding energy, $E$ , used to discriminate between specific and non-specific interactions. We expect the range of concentrations $p_{\rm max}/p_{\rm min}$ , in which reliable assembly is possible, to scale exponentially with $E$ (i.e., chemical potentials scale linearly with the energy). For the model at hand, we find indeed that (see SI B):

[TABLE]

see also Figs. 2E and F. Eq. 1 implies that a binding energy of a few $k_{\rm B}T$ is sufficient to ensure reliable assembly in a range of concentrations spanning several orders of magnitude. The parameter $\mu_{0}<0$ depends logarithmically on characteristics of the mixture: the number of component types, or “species”, $N_{\rm tot}$ , and the number of different complexes, $K$ . It also depends on characteristics of a typical complex $c$ : the total number of components it contains (i.e. its size), $M_{c}$ , the number of different species among these components, $N_{c}$ , and the number of specific bonds per protein, $z_{c}$ . For the sake of simplicity, we will limit ourselves in the following to the case $z_{c}=4$ , i.e. to a square lattice. However, our results can be generalized to other values of $z_{c}$ , see SI C.

Heterogeneity and sparsity constrain reliable assembly

The accurate assembly of complexes is a daunting discriminatory task. Proteins accomplish this task because their interactions and the composition of complexes they form are the result of “constrained evolution”. That is, the characteristics of complexes have evolved to ensure their cellular function, while at the same time they have been constrained to assemble reliably. As argued above, an important quantity, which is closely related to the reliability of the assembly process, is protein promiscuity: if the promiscuity of a protein were exceedingly high, undesired protein species could interfere with this protein during the assembly process, which would then result in the formation of “chimeric structures”. Therefore, we expect that evolution tuned protein promiscuity so that proteins do not form such non-functional chimeras.

In our model the promiscuity of a protein-like component is related to the number of different complexes, $K$ , and to their characteristics, such as their compositional heterogeneity, $h_{c}\equiv N_{c}/M_{c}$ ; and their sparsity, $a_{c}\equiv(N_{\rm tot}-N_{c})/N_{\rm tot}$ , i.e. the fraction of their proteome usage. One can show that the promiscuity of a component species $\alpha$ scales as:

[TABLE]

(see SI C for a detailed derivation). One can then express the evolutionary constraint of reliable self-assembly by relating the probability of the formation of chimeric structures to component promiscuity, and requiring that this probability is negligible. By doing so, we find that the number of possible coexisting complexes, their heterogeneity, their size, and their sparsity obey an important constraint relation:

[TABLE]

where the values of the exponents are given for $z_{c}=4$ , see SI C for generalization. Eq. 3 provides the scaling for the surface of transition between the MA regime, in which chimeric structures are avoided, and the Ch regime, in which they readily form (depicted in Fig. 3A). Although this relation holds in the limit $M_{c}\to\infty$ , our analysis of Monte-Carlo simulations is compatible with Eq. 3 (however, to unequivocally determine the scaling would require a much larger range of complex sizes, see SI D).

The key determinants of the transition from the MA regime to the Ch regime, predicted by the constraint of Eq. 3, might be biologically relevant. This constraint can be understood as a trade-off between the number of coexisting complexes, their heterogeneity, and their sparsity: more complexes can coexist if they are more heterogeneous, and if they make a sparser usage of the proteome. To see that this is indeed the case, let us first consider a single complex type ( $K=1$ ) prepared in a mixture that contains only its constituent components, so that $N_{\rm tot}=N_{c}$ and $a_{c}=0$ . Eq. 3 constrains the heterogeneity of the complex to be larger than a quantity that scales as a power of its size, $h_{c}\gtrsim M^{-1/3}_{c}$ . Therefore, by increasing the heterogeneity of the complexes, one crosses a transition from the chimeric regime to the desired multifarious assembly regime, as shown in Figs. 3B and C. This mechanism might thus explain the observed high heterogeneity among cellular protein complexes as the means of avoiding assembly of incorrect chimeric structures.

Next, let us consider the possibility of combinatorial usage of components in different complexes. In the case of a dense usage of the set of components, i.e. $a_{c}=0$ , the reliable assembly constraint, Eq. 3, implies that the number of possible coexisting complexes is $K\lesssim h_{c}N_{\rm tot}^{1/2}$ . Such increase of the number of complexes with increasing number $N_{\rm tot}$ of component species implies that combinatorial usage of components in complexes is indeed possible Murugan et al. (2015). However, the reliability constraint makes the combinatorial aspect only sub-linear, and therefore weak: an increase by a factor of one hundred in the number of component species merely increases by ten the number of possible complexes. Therefore, from a biological perspective, reliable assembly introduces a constraint that vastly reduces the possibilities of combinatorial expansion from proteins into protein complexes.

In order for many complexes to coexist, an alternative to this weak combinatorial usage is needed. This alternative is achieved by letting complexes make a sparse usage of the set of components, i.e. $a_{c}\lesssim 1$ . To see this, note that the number of possible coexisting complexes, $K$ , diverges in Eq. 3 as the component usage becomes more and more sparse, $a_{c}\to 1$ , see also Fig. 3D. An important consequence of such behavior is that the number of coexisting complexes scales super-linearly with the number of component species when a sparse usage is allowed,

[TABLE]

From a biological perspective, due to the previously evoked sparsity of the proteome usage, an increase by a factor of one hundred in the number of protein species results thus in an increase by a factor of a thousand in the number of complexes. Therefore, a sparse usage of the proteome may indeed help to insure the observed coexistence of many different types of protein complexes within the cell.

Structural and compositional data points towards high heterogeneity of protein complexes

How relevant are the theoretical arguments presented above for cellular protein complexes? It is clear that the proteome usage is generically sparse: even a complex with as many different protein species as the ribosome, for which $N_{c}\lesssim 10^{2}$ , contains only a small fraction of the proteome, $N_{\rm tot}\gtrsim 10^{3}$ . This gives a lower bound for the sparsity of complexes: $a_{c}>0.9$ . However, the second issue of whether highly heterogeneous complexes might coexist more easily, and thus would be more abundant in the cell, could be considered a subject of debate.

To address this question, we analyzed a publicly available database of protein complexes Meldal et al. (2018). The resulting histogram of the heterogeneity of these complexes, separated by their size, (see SI H for details), is depicted in Fig. 4A. The histogram reveals a large abundance of high-heterogeneity complexes, $h_{c}\in[0.8,1.0]$ , which supports the arguments presented in this work. From Fig. 4A it is also apparent that a significant number of complexes have intermediate values of heterogeneity, $h_{c}\in[0.4,0.6)$ .

Given that our theory applies to structures without any geometrical symmetry (as described in Materials & Methods), we can ask whether geometrical symmetry underlies the low heterogeneity of these complexes. For example, a complex with precisely four copies of the same protein, $h_{c}=1/4$ , may be reliably assembled if the proteins are wedged so that they lock in a symmetric ring-like structure, see Fig. 4B. The constraints that limit assembly reliability in this type of complexes are likely to be different from those for asymmetric complexes, see Fig. 4C. To estimate the role of symmetry in the heterogeneity of complexes, we classified them into those, for which the crystal structure exhibits symmetry, and those, for which it does not Berman et al. (2000). In Fig. 4D we plot the corresponding heterogeneity histograms, which clearly show that asymmetric complexes have a very large heterogeneity bias, whereas symmetric complexes exhibit a large peak for intermediate heterogeneity values. We thus corroborate that high heterogeneity is indeed widespread among asymmetric complexes, to which we have limited our model. These conclusions should be taken with a grain of salt, though, since generally speaking the databases of protein complexes are in their infancy, and are prone to many possible methodological biases and ambiguities (see also discussion in SI G).

Discussion

Additional mechanisms to prevent chimeric assembly

Within the cell, assembly of complexes takes place in a dense mixture of proteins. At the basis of successful assembly lays the discrimination of the particular proteins of a complex among many others present in the mixture. Here, we argued that thermal physics puts strong constraints on the characteristics of complexes so that they assemble reliably. In particular, to keep protein promiscuity low, the heterogeneity and sparsity of complexes is constrained to high values. Clearly, within the cell there are additional mechanisms that may help avoid chimeric assembly, some of which we describe now.

First, cellular protein concentrations can be spatially and temporally controlled to increase assembly precision and yield. In particular, cellular liquid droplets can provide local environments of high concentration of certain proteins, which may increase the assembly yield of corresponding complexes. A well known example of cellular “compartmentalization” is the assembly of ribosomes inside the nucleolus, where ribosomal proteins are synthesized Sirri et al. (2008). Similarly, the temporal aspects of production and transport of proteins can be regulated so to facilitate and optimize complex assembly Kalir et al. (2001). Second, it is highly plausible that the energy landscape for protein-protein interactions itself might have evolved so to facilitate the kinetics of protein complex assembly. This is similar to the arguments given for other classical discrimination phenomena in biology, such as the discrimination of correct initiation DNA sites by transcription factors Tafvizi et al. (2011); Cencini and Pigolotti (2017). A third possible mechanism to prevent formation of chimeric structures is the usage of non-pairwise protein interactions, e.g. allostery. For example, it was suggested in Ref. Antebi et al. (2017) that different tetrameric receptor complexes of the BMP pathway assemble upon binding of particular ligands. Furthermore, because only seven different species of proteins are involved in forming these receptors, the usage made of these proteins is dense, rather than sparse, which suggests that allostery provides a means to make a dense usage of proteins and enable combinatorial expansion. Finally, the geometry of complexes itself might have also evolved to optimize assembly. The presence of a peak at intermediate values of the heterogeneity in Fig. 4A, which can be ascribed to the symmetry of complexes, strongly suggests that the heterogeneity constraint that we derived may be avoided by means of geometric constraints to the structure of the complex, in line with the findings in Refs Ahnert et al. (2015); Garcia-Seisdedos et al. (2017).

Two organizational principles of cellular protein assembly explored in our theoretical model, namely the high heterogeneity of protein complexes, and the sparse usage they make of the proteome, should be already functional at thermal equilibrium. So should be also the four additional mechanisms described above; therefore, they could be incorporated and studied within future extensions of our model. Going beyond equilibrium considerations, it is important to stress that the accuracy and the speed in protein assembly are very likely to be enhanced by a number of out-of-equilibrium mechanisms Hopfield (1974); pigolotti2016protocols; Bisker and England (2018). For example, more than two hundred non-ribosomal proteins are involved in ribosomal biogenesis Kressler et al. (2010). Many of these are energy-consuming enzymes and have a variety of roles, e.g. stabilizing protein-RNA interactions. At the same time, in vitro studies have shown that it is possible to assemble ribosomal complexes in absence of such enzymes Mizushima and Nomura (1970); Röhl and Nierhaus (1982); Mulder et al. (2010), albeit with a smaller yield. This evokes a possible analogy with the process of protein folding, which is typically facilitated and sped up by energy consuming chaperones, but also can take place in their absence Hingorani and Gierasch (2014). It is tempting to propose that just as evolution had selected foldable proteins from the vast space of possible amino acid chain sequences, it might have also selected reliably self-assembling cellular complexes from the vast space of all possible multi-protein assemblies, see Fig. 1.

Broad distribution of protein participation in complexes and dynamic control

An additional role of out-of-equilibrium processes is to control the dynamics of protein complexes. Indeed, many complexes are assembled in a contextual manner, i.e. only when they carry out a function that is needed. Such highly dynamic complexes are often involved in regulatory and signaling functions Stein et al. (2009); Antebi et al. (2017), and they can be contrasted with other more stable complexes (such as the ribosome), on which we have focused our attention here. The complexes including cyclins, which participate in the regulation of the cell cycle, provide an example of such highly dynamic complexes Murray (2004). The temporal sequence of assembly and disassembly events observed in this system is correlated with phosphorylation / dephosphorylation of the cell cycle components. Remarkably, the heat release in this out-of-equilibrium process has also been recently measured Rodenfels et al. (2019).

A possible footprint of regulated out-of-equilibrium phenomena is the existence of proteins that participate in many complexes, such as cyclin dependent kinases, which have potential to act as dynamic controllers. In order to assess the prevalence of such proteins, we analyzed several databases that contained information of protein participation in complexes (see SI I). Our analysis suggests that the number of complexes, $q_{\alpha}$ , in which a given protein species $\alpha$ participates has a broad distribution, depicted in Fig. 5. Notably, this distribution cannot be simply explained by randomly grouping proteins into sets with the observed composition of complexes, unlike in other systems with shared components Mazzolini et al. (2018). The “excess” of highly participatory proteins can be viewed as an indication that this broad distribution might have evolved to carry out a particular function. One such function could be to assure an appropriate dynamic control of different assemblies (provided of course that this broad distribution is confirmed by future analysis of larger databases). A fascinating topic for future research should be, therefore, to extend our theoretical framework so to allow dynamic, out-of-equilibrium control of complexes, and to verify whether the latter could indeed correlate with the observed excess of highly participatory proteins.

Protein complexes as a distinct regime of matter

Reliable self-assembly of protein complexes within the noisy cellular environment is intriguing not only from the point of view of cell biology, but also from that of material science. Typical materials contain only a handful of different sorts of atoms, and the possible set of interactions between components is small. This results in states of matter that are easily reproducible: all crystals of salt are formed by Na and Cl atoms arranged in the same way. A different scenario is that of glassy materials, such as silicate glasses, in which the set of effective interactions among constituents is large due to spatial disorder. The large number of interactions makes the state of a glass unique and irreproducible: in each piece of glass the arrangement of atoms is different.

Protein complexes combine similarities with both types of materials. Like glasses, they present a large number of effective interactions, although its origin is not disorder, but the large number of different specifically interacting components. As in crystals, the arrangements of these components may be highly reproducible: all ribosomes in a cell are made of many different proteins, yet the arrangement of the core ribosomal proteins is basically the same. This unusual combination of properties is rarely considered in physical theories of matter, with the closest analogue being “programmable materials” of biological origin, such as DNA origami Jones et al. (2015) or self-assembling colloidal particles Zeravcic et al. (2017). One of the directions of future studies could be to further explore underlying principles of reliable equilibrium and non-equilibrium assembly for synthetic materials inspired by biology.

Materials and Methods

Model

Consider a set of $N_{\rm tot}$ component species labeled $\alpha=1,\ldots,N_{\rm tot}$ , which form the “proteome” of our theory. Components can assemble into $K$ different complexes labeled $c=1,\ldots,K$ . How two components of different species, $\alpha$ and $\beta$ , interact when they are in proximity is characterized by the binding energy matrix $U_{\alpha\beta}$ . If these two components are part of a same complex $c$ in which they are bound to each other, they will interact strongly with an energy $E$ , and so $U_{\alpha\beta}=-E$ . Conversely, if the components $\alpha$ and $\beta$ are not bound together in any complex, their interaction energy will be assumed null: $U_{\alpha\beta}=0$ . However, even when the interaction energy between two components is null, these may still bind to each other through non-specific interactions, provided that their concentration, $p$ , viz. their chemical potential $\mu=\log(p)$ , is large enough. Note that we have made the simplifying assumption that the strength of all interactions and the concentrations of all components are the same; this assumption is relaxed in SI F. One important quantity that characterizes the interactions of a component species $\alpha$ is its promiscuity, $\pi_{\alpha}=\sum_{\beta}\Theta(-U_{\alpha\beta})$ , which is the total number of different species with which it has specific interactions. We also define the participation of the species, $q_{\alpha}$ , as the number of different complexes in which it takes part.

Although the structural or enzymatic characteristics of complexes largely define their function, here we are only interested in the characteristics that determine self-assembly. We quantify these characteristics through a small set of parameters. For each complex $c$ , we define its heterogeneity, $h_{c}\equiv N_{c}/M_{c}$ , as the ratio of the number of different component species in the complex, $N_{c}$ , to the total number of components in the complex or complex size, $M_{c}$ (we have made the assumption that all complexes have the same number of components, relaxed in SI G). We also define the sparse usage that a complex makes of the available components, i.e. its sparsity, as $a_{c}\equiv(N_{\rm tot}-N_{c})/N_{\rm tot}$ . Finally, we absorb all geometrical properties of a complex into its mean coordination number, $z_{c}$ , which counts the average number of specific bindings per component in the structure. The latter is an important simplification, which may particularly affect certain complexes, for which the assembly is strongly tied to their geometry Ahnert et al. (2015). Note that the model in Murugan et al. (2015) corresponds to the case $h_{c}=1$ and $a_{c}=0$ (in addition to fixing the ratio $\mu/E$ ). Here, we relax these constraints, which allows us to explore the biologically relevant regime of protein complex assembly.

Figure parameters

In Fig. 2 we considered $N_{c}=M_{c}=N_{\rm tot}=20^{2}$ and $K=8$ . The initial state corresponds to a fragment of the complex containing its central components (a nucleation seed). The fragment is of size $7\times 7$ , in A and B, and $20\times 20$ (whole complex) in all other panels. In E and F each point reports the average of $3$ replicate simulations (randomized complexes). Each simulation is run for a duration of $10^{6}\,{\rm latt}$ in a $40\times 40$ lattice, with $1\,{\rm latt}$ corresponding to one lattice sweep. In Fig. 3 B and C the parameters are as in Fig. 2, with $E=7$ , $\mu=-12.6$ , and $K=1$ . In D, the parameters are as in B and C, with $h_{c}=1$ and variable $K$ .

Theoretical analysis

.1 Computational model

To support our analytical results, we performed grand canonical Monte Carlo simulations of a lattice implementation of our model, see Fig. 6 for a summary of notation. We begin by describing how complexes were generated, see also Fig. 7. We considered each of the $c=1,\ldots,K$ different complexes to be a square arrangement of square components representing proteins with lateral size $\sqrt{M}$ , so that $M_{c}=M$ (all complexes have the same size). Each complex contains $N$ different component species, $N_{c}=N\leq M$ (all complexes have the same number of species). The total number of component species is $N_{\rm tot}\geq N$ , and for each complex its $N$ different component species were randomly selected among the $N_{\rm tot}$ . This ensured a sparsity of $a_{c}=a=1-N/N_{\rm tot}$ , and allowed complexes to share some components with each other. Once the composition of each complex was decided, the spatial arrangements of the components in each complex was determined in the following way. For each complex, each of the corresponding $N$ selected component species was placed in $M/N$ random locations on a square of side $\sqrt{M}$ . The completely filled square constituted a complex, see Fig. 7. For the case in which $M/N$ was not integer, the remainder locations on the square were randomly assigned to different component species, such that $N\times\lfloor M/N\rfloor$ species are repeated $\lfloor M/N\rfloor$ times in each complex, and $M-N\times\lfloor M/N\rfloor$ species are repeated $\lfloor M/N\rfloor+1$ times. The result is a heterogeneity $h_{c}=h\approx N/M$ for all complexes. This procedure allowed us to generate a controlled number of potentially coexisting complexes with given heterogeneity, sparsity, and size.

Following the main text, we defined the interaction energy among component species based on whether two species interact within any of the $K$ different complexes. Note that each complex defines an adjacency bond matrix $B_{\alpha\beta\delta}^{c}$ , where $\alpha$ and $\beta$ are the species that participate in the complex and $\delta=1,\ldots,z$ is the valence, with $z=4$ being the coordination number of the chosen square lattice. The elements of the bond matrix are as follows: if two components of species $\alpha$ and $\beta$ are neighbors with valence $\delta$ in complex $c$ , then $B_{\alpha\beta\delta}^{c}=1$ , otherwise the matrix element is null. We used the bond matrices of complexes to define the interaction energy matrix in the following way:

[TABLE]

where $\Theta$ is the Heaviside function, see also Murugan et al. (2015). In addition to the binding energy matrix, each species is characterized by its chemical potential, $\mu_{\alpha}$ .

Using $U_{\alpha\beta\delta}$ and $\mu_{\alpha}$ , we could define the energy of the system, $\mathcal{U}$ , crucial in Monte Carlo simulations. To do so, we characterized the state of the system by an occupancy matrix $\sigma_{\alpha i}$ , which takes the value $1$ if the $i-$ th lattice site is occupied by species $\alpha$ , and zero otherwise. For notational convenience we used the label $\alpha=0$ to denote empty sites, and set $\mu_{0}=0$ and $U_{0\beta\delta}=U_{\alpha 0\delta}=0$ . With these definitions we could write the energy of a lattice configuration $\sigma_{\alpha i}$ as

[TABLE]

where the first sum spans pairs of near neighbors, and the second spans the whole lattice. Note that, besides for $\alpha=0$ , the chemical potential of all species is homogeneous throughout the lattice and among all species, and so $\mu_{\alpha}=\mu$ for $\alpha>0$ . Note also that our sign convention is such that $E>0$ , i.e. $U_{\alpha\beta\delta}<0$ , and $\mu<0$ , so that larger and positive $E$ results in stronger binding, whereas lower negative $\mu$ corresponds to lower concentrations (this convention is different from that in Murugan et al. (2015)).

The Monte Carlo algorithm used in the simulations was as follows:

A lattice site $i$ and a species $\alpha$ (including $\alpha=0$ ) were randomly selected. 2. 2.

The change in energy, $\Delta\mathcal{U}$ , associated with changing the species currently in $i$ by $\alpha$ was calculated using Eq. 6. 3. 3.

The Glauber transition probability was calculated, $W=(1+\exp(\Delta\mathcal{U}))^{-1}$ . 4. 4.

A random number $p\in[0,1]$ was generated. 5. 5.

The species change was performed or not, depending on whether $p<W$ or not.

Steps $1-5$ constituted one Monte Carlo step. Typical simulations were run for $\sim 10^{6}\,{\rm latt}$ , where $1\,{\rm latt}$ is a lattice sweep and corresponds to $L$ Monte Carlo steps, with $L=\sqrt{L}\times\sqrt{L}$ the size of the square lattice. Unless otherwise specified, we used $\sqrt{L}=40$ .

In order to quantify our simulations, we used the density, the energy, and the assembly error as order parameters. The density is defined as the fraction of lattice sites that are occupied (i.e., they contain species $\alpha>0$ ). The energy of a lattice configuration, $U$ (measured in units of $-E$ ), is the total number of bonds among near-neighbor lattice sites for which $U_{\alpha\beta\delta}=-E$ . Our definition of assembly error is the same as in Murugan et al. (2015). First, the number of sites, $S$ , that match between the target complex and the largest connected cluster in the lattice is computed. Second, the number of sites, $Q$ , in the union of the largest connected cluster and the initial seed is also computed. Third, the error is defined as one minus the ratio of these quantities, ${\rm Err}=1-S/Q$ .

It is important to note that in the chimeric (Ch) regime this definition of the error does not result in zero error, but rather the error converges to a saturation value ${\rm Err}_{\rm sat}$ . This is because in Ch the initial seed remains unchanged and it nucleates chimeric structures that fills the whole lattice. The value of the saturation error thus depends on the initial seed size and on the lateral lattice size $\sqrt{L}$ . In particular, if the initial seed has lateral size $\gamma\sqrt{M}$ , with $\gamma<1$ , then once the whole lattice has been filled by chimeric structures, we have $S=\gamma^{2}M$ , whereas $Q=L$ . Therefore, the saturation error is given by

[TABLE]

In Fig. 2E the initial seed is the whole structure, $\gamma=1$ , while the lateral lattice size is $\sqrt{L}=2\sqrt{M}$ . This gives ${\rm Err}_{\rm sat}=3/4$ , which agrees with the value of the error in Ch, seen in Fig. 2E. In contrast, both in the liquid regime (L) and in the dilute solution regime (DS) the initial seed dissolves, and the error is of order unity, as can also be seen in Fig. 2E. These two latter regimes can however be distinguished by their density, as L is dense and DS is dilute, see Fig. 2F. The multifarious assembly regime (MA), on the other hand, can be identified by an error value that approaches asymptotically zero, and mean density of $1/4$ for the chosen simulation geometry.

.2 Boundaries between regimes in $(E,\mu)$ space

The four different regimes of a multifarious mixture that can be observed in Fig. 2 are separated by curves that are roughly linear. We now characterize these curves. For simplicity, we focus on the case $M=N=N_{\rm tot}$ relevant for Fig. 2. In this case the usage of the available component species is “dense”, and complexes are fully heterogeneous. In the coming sections we will further discuss the roles of heterogeneity and sparsity.

DS to MA transition. The low density DS regime is separated from all other regimes by the straight line

[TABLE]

where the slope $1/2$ is connected to the geometry chosen in our model, a square lattice. To see this, consider the fully assembled complex as the initial state. The components that are bound more weakly are those at the corners, as each one of them only has two neighbors (unlike three for surface components and four for interior components). These components will be removed (on average) if $2E<-\mu$ . Once these are removed, the subsequent surface components will also have two bonds, and they will also be removed. This “peeling-off” will continue until the whole complex disassembles, which explains the boundary between DS and L (this is an example of textbook arguments for lattice gas models).

MA to Ch transition. To explain the onset of Ch, we analyzed simulations, and noted that the event that most likely resulted in chimeric structures was the one depicted in Fig. 8A. Consider two neighboring components, of species $\gamma_{1}$ and $\gamma_{2}$ , at the boundary of an assembled complex, depicted in brown in Fig. 8A. Besides participating in the brown complex, these species may also participate in other complexes, in which they will have different neighboring component species. A component, $\alpha$ , which does not belong to the brown complex, can thus form a bond with $\gamma_{1}$ if these two component species, $\alpha$ and $\gamma_{1}$ , are neighbors in a complex different from the brown one, a blue complex in Fig. 8A. Similarly, another component $\beta$ can form a bond with $\gamma_{2}$ if $\beta$ and $\gamma_{2}$ are neighbors in a third complex, depicted in red in Fig. 8A. In principle, $\alpha$ and $\beta$ only have one bond each with the brown complex. However, if they happen to be neighbors in a fourth complex, depicted in green in Fig. 8A, then each of them is stabilized by two bonds. In this case, since $2E>-\mu$ , they will form a stable nucleation seed for that fourth complex, which will result in a chimeric structure. In Fig. 8B we show an example of this process observed in a simulation. We can estimate the rate of such event, $W_{\rm chim}$ , by multiplying the transition rates for binding $\alpha$ and for binding $\beta$ by the corresponding entropic factors:

[TABLE]

The first term in parenthesis corresponds to the binding of $\alpha$ , and includes two entropic factors and one energetic factor. The first entropic factor $4\sqrt{M}$ arises from the length of the boundary of the existing structure, which we took to be the whole complex. The second corresponds to the chances for component $\alpha$ to be the partner of a given surface component of $c_{a}$ . Because there are $K$ choices for $c_{b}$ and a total of $M$ different species, this contributes a factor $K/M$ . The second term in parenthesis includes two entropic factors: one factor of two, because $\beta$ can be bound to both sides of $\alpha$ , and a factor of $K^{2}/M$ , which corresponds to the same factor as before, times $K$ chances that the second bond also matches for the given species $\beta$ . The energetic factors correspond to $\alpha$ establishing only one bond, so that $\Delta\mathcal{U}=E+\mu$ , and $\beta$ establishing two bonds, so that $\Delta\mathcal{U}=2E+\mu$ .

The transition line between Ch and MA is characterized by high chances of chimeric nucleation, that is $W_{\rm chim}T_{\rm nucl}\approx 1$ with $T_{\rm nucl}$ the time needed to nucleate a large chimeric structure. This condition results in a non-linear dependence of $E$ on $\mu$ . For large energies (i.e. $E\gg 1$ , but $E$ and $\mu$ comparable), Eq. 9 simplifies to $W_{\rm chim}\approx q(K,M)\exp(E+\mu)/(1+\exp(E+\mu))$ , where $q(K,M)=8K^{3}M^{-3/2}$ is the multiplicity factor. The transition line then becomes

[TABLE]

with $\mu_{0}=-\log(q(K,M)T_{\rm nucl}-1)$ . Putting this result together with the transition line between MA and DS, Eq. 8, we see that the range of chemical potentials for MA is given by $\mu_{\rm max}-\mu_{\rm min}=-E+\mu_{0}+2E=E+\mu_{0}$ , which is indeed equivalent to Eq. 1 in the main text. To generate the transition line depicted in Fig. 2, we estimated $\mu_{0}$ by analyzing the Ch-MA transition for $\mu=-30$ , and fitting the sigmoidal $g(E)={\rm Err}_{\rm sat}+(1-{\rm Err}_{\rm sat})/(1+\exp(k(E+\mu_{0})))$ . This resulted in $\mu_{0}=-10.27\pm 0.02$ , from which we obtain a value $T_{\rm nucl}\approx 6\,\cdotp 10^{4}$ .

In Fig. 9 we show an example of the dynamics of assembly in the chimeric regime. As one can see, the appearance of large assembly errors due to chimeric growth roughly takes place at the estimated nucleation time $T_{\rm nucl}$ . This further supports our analysis. In particular, in Fig. 9C we quantify the dynamics of the system energy, $U$ . As stated above, $U$ is the number of bonds between components that are specific, i.e. that correspond to the assembling complex. For an isolated complex, the energy is $U_{\rm comp}=-4E(\sqrt{M}\times\sqrt{M}-\sqrt{M})/2$ , where $4$ is the number of matching bonds for each component, and $2$ ensures that we are not counting bonds twice. This value matches the initial state in Fig. 9C. As the system evolves, more components are added, and the system decreases its (negative) energy as it becomes more stable. The maximum number of specific bonds that can be generated is $U_{\rm min}=-2E\sqrt{L}\times\sqrt{L}$ , with $L$ the system size. In Fig. 9C we can indeed see that the negative energy of the system stays always below $U_{\rm min}$ . Eventually, the system “freezes” into a chimeric state and the energy saturates to an intermediate value. We estimate the energy of the chimeric state as the energy of a complex plus the energy of the surrounding material: $U_{\rm chim}\approx U_{\rm comp}-E(\sqrt{L}\times\sqrt{L}-\sqrt{M}\times\sqrt{M})f/2$ . Here, the factor $f$ corresponds to the average number of bonds in the material surrounding the complex. For the case in which all bonds match corresponding to MA, $f=4$ . At the other extreme, when growth occurs through filamentous structures, we have $f=2$ . In the case presented in the figures we used the intermediate value $f=3$ .

Ch to L transition. This transition occurs at low values of $E$ and shows no dependence on $\mu$ , see Fig. 2E and F. To understand why this is the case, note that because both regimes are dense, components are replaced by other components without any role for empty sites, therefore the transition rate is independent of $\mu$ . We determined the transition energy $E_{0}$ by fitting on the $\mu=0$ line the sigmoidal function $g(E)={\rm Err}_{\rm sat}/(1+\exp(k(E-E_{0})))$ . This resulted in $E_{0}=4.0\pm 0.1$ . Although we do not extensively characterize the boundary between Ch and L, in Fig. 10A we show how the onset of L is preceded by a decrease in the energy of the system. This is further quantified in Fig. 10B, e.g. for the $\mu=0$ curve. The reason for the observed non-monotonicity is that at low binding energies the state of the system is not fully frozen, and components can rearrange over time to form large fragments of complexes separated by disordered boundaries. Examples of such behavior for different values of the energy through the transition are depicted in Fig. 10C.

.3 Scaling laws

In the previous section we discussed the boundaries of the different regimes of our model in the thermodynamic space $(E,\mu)$ . The position of these boundaries depend on the parameters that characterize the complexes, $(h,a,K,M)$ , through algebraic relations. Such dependencies result in the scaling laws of the main text, which we now derive.

Estimating promiscuity. We begin by providing an estimate for the promiscuity of components. First, we note that a given component can bind $z$ other components simultaneously, and it thus has $z$ different binding sites. We now study the promiscuity per binding site of a given species $\alpha$ . Because on average the promiscuity of all components is the same, hereafter we note $\pi_{\alpha}=\pi$ . We first discuss the case of a mixture that only consists of components from a single complex ( $K=1$ and $N=N_{\rm tot}$ ). Importantly, the dependence of the promiscuity on the heterogeneity is non-monotonic. For very low heterogeneities, $N\ll M$ , the promiscuity is roughly equal to the number of species, $\pi\approx N=hM$ , i.e. all species interact with all species. This is because the probability that none of the $M/N$ members of one species interacts with another given species in the structure is $P_{\rm no\;int}\approx(1-1/N)^{M/N}$ , which for $N\ll M$ is neglible. However, as $N$ increases and becomes comparable to $M$ , the chances of two species to not interact becomes of order one. We can estimate the value of the heterogeneity at the cross-over between these two cases, $h_{0}$ . This is done by expanding asymptotically in the system size, $\epsilon\equiv M^{-1}$ ,

[TABLE]

We see now that for

[TABLE]

the second term becomes small. In this case, the probability that there are no interactions becomes of order one, and our previous estimate breaks down.

We can obtain a different estimate of the promiscuity for large values of the heterogeneity. If the complexes are fully heterogeneous, then each species will interact just with another one, and so the promiscuity is $\pi=1$ . Reducing the heterogeneity increases the number of members of each species present in the complex. If each of these species interacts with a different species, we have $\pi\approx M/N=h^{-1}$ . It is this second estimate of promiscuity that we used in the main text. The justification for the choice of this second estimate is as follows. The scaling of the cross-over heterogeneity is $h_{0}\sim M^{-1/2}$ , while the scaling of the heterogeneity value that determines the MA to Ch transition is $h\sim M^{-1/3}$ (see Eq. 15 below). Therefore, for the values of $h$ in which the second estimate of the promiscuity breaks down and the first estimate should be used instead, the system is already in the chimeric regime. In Fig. 11A and B we show how randomly reshuffling $M$ components of $N$ different types indeed results in an average promiscuity that behaves as described.

We now generalize the second estimate of promiscuity for the case in which there are several complexes and the usage of the component species is sparse, i.e. $K>1$ and $a<1$ . In this case, for randomly sampled complexes, the promiscuity will be decreased by the probability that a given component species is selected, $N/N_{\rm tot}=1-a$ , and increased by the number of total complexes, $K$ . We thus have

[TABLE]

which is used in the main text. We stress that this estimate holds for randomly generated complexes of size $M$ with heterogeneity larger than $h_{0}\approx M^{-1/2}$ .

MA to Ch scaling. We can use the estimate of the promiscuity in Eq. 13 to obtain an estimate of the maximum number of coexisting complexes that can be reliably assembled. To do so, consider a square seed of a complex on a square lattice, i.e. $z=4$ (we will later generalize to arbitrary $z$ ). The components added to the edge of a growing layer (for a square complex, the edge is a line of components) have two near neighbors. Each of the two components in the growing tip of the complex defines a set of $\pi$ species, with which the component to be added can have specific interactions. For small $\pi$ , the probability that these two sets intersect, $P_{\cap}$ , is small: the only intersection corresponds to the correct species for the given complex. Therefore, for low promiscuity assembly errors are scarce. On the other hand, as $\pi$ increases the probability of these two sets to intersect can become of order unity. The seed starts then to grow into a chimeric structure. We shall now estimate $P_{\cap}$ .

We have $P_{\cap}=1-P_{\neg\cap}$ , and for random structures, $P_{\neg\cap}$ can be estimated as the probability that each of the $\pi$ components $\alpha$ included in the first set are not in the second, and so $P_{\neg\cap}=\prod_{\alpha}P_{\neg\cap_{\alpha}}=P_{\neg\cap_{\alpha}}^{\pi}$ . Estimating the probability that a component $\alpha$ is in the second set as $P_{\cap_{\alpha}}\approx{\pi}/{N_{\rm tot}}$ , we arrive at

[TABLE]

where in the second equality we have used the definitions of heterogeneity and sparsity, and –as before– we have taken $\epsilon=M^{-1}$ . To calculate the scaling for the boundary separating the Ch and MA regimes, we set $P_{\cap}\lesssim 1$ . This results in

[TABLE]

Setting in this equation $a=0$ and $K=1$ , we recover $h\gtrsim M^{-1/3}$ , while setting only $a=0$ we obtain $K\lesssim hN_{\rm tot}^{1/2}$ . Generalization to an arbitrary coordination number $z$ requires to note that the surface components have in general $z/2$ neighbors. Our previous estimate is then modified by considering the intersection of $z/2$ sets, and so

[TABLE]

which through an expansion in $\epsilon$ directly results in

[TABLE]

We can reconcile these scaling relations with the entropic factors found for the rate for chimeric structure formation, Eq. 9, in the following way. Consider that a point in $(E,\mu)$ within the MA regime is chosen. If we start increasing the number of complexes, the entropic factor $q(K,M)$ of $W_{\rm chim}$ in Eq. 9 will increase algebraically, shifting the offset $E_{0}$ of the boundary with the Ch regime, see Fig. 2. Eventually, the point chosen in $(E,\mu)$ will fall in Ch. Note that Eq. 9 predicts a particular scaling for this shift in the boundary offset, which is precisely $K\sim\sqrt{M}$ , analogous to Eq. 15 in the limit of $N=M=N_{\rm tot}$ .

Throughout the main text we used $z=4$ . The reason is that for this value we could carry out numerical simulations on a square lattice that validated our analytical results. We now discuss the validity of our arguments for other values of $z$ .

•

First, we note that the heterogeneity scaling, which was $h\gtrsim M^{-1/3}$ for $z=4$ , is in general given by $h\gtrsim M^{(2-z)/(2z-2)}$ . The size scaling exponent is therefore negative for $z>1$ (it diverges for $z=1$ ), and is always larger than $-1/2$ . Because $h_{0}$ has a size scaling exponent of $-1/2$ , see Eq. 12, the latter ensures that the estimate for the promiscuity that was used, Eq. 13, is adequate all throughout the MA regime. The fact that, for $z>1$ , the lower bound for $h$ has a negative size scaling exponent allows us to conclude that in all cases there is a lower bound for heterogeneity that decreases with system size. Therefore, our argument about the role of high heterogeneity in assembly should remain unaffected for $z>1$ . Note that the only complexes for which $z=1$ are dimeric complexes, which we do not expect to obey our scaling predictions due to their small size.

•

The combinatorial expansion for a dense usage of the proteome, $a=0$ , is in general bounded by $K\gtrsim hN_{\rm tot}^{(z-2)/z}$ . This results in a sub-linear combinatorial expansion for $z>2$ . Therefore, the argument of weak combinatorial usage is valid for $z>2$ . The case $z=2$ , which corresponds to filamentous complexes, in which each component has two neighbors, does not allow for any combinatorial expansion, as already noted in Murugan et al. (2015).

•

Finally, allowing for a sparse usage of components, we obtain $K\gtrsim N_{\rm tot}^{(2z-2)/z}M^{-1}$ , and the super-linear scaling that results from sparsity is preserved for all values $z>2$ . Again, the case $z=2$ is special, as it results in a linear scaling with the proteome size.

These arguments show that the three main conclusions of our work should remain valid for $z>2$ : ( $i$ ) higher heterogeneity reduces chances of chimeric structure formation, ( $ii$ ) combinatorial expansion is weak for dense proteome usage, ( $iii$ ) sparsity allows for super-linear combinatorial expansion. The case $z=2$ , which corresponds to filaments, has to be discussed separately. Within our framework, for $z=2$ only fully heterogeneous complexes (filaments) can be assembled. To see this, consider a filamentous complex, in which one species $\alpha$ appears twice, while the rest appear only once. If $\alpha$ is bound to the growing tip of the complex, there are two possible partners that can bind to it with equal strength, which would result in incorrect assemblies with probability of one half. For the same reason, components can not be shared. Therefore, for $z=2$ a dense usage of the proteome only allows for a single complex. A sparse usage clearly allows for a number of complexes that, in order not to share components, must increase linearly with the proteome size. Therefore, in this case sparsity is also (rather trivially) necessary.

Ch to L scaling. The scaling of the transition from Ch to L can be estimated with a similar argument as that of the transition from MA to Ch. We consider a full complex surrounded by a disordered chimeric structure, and our goal is to determine under what conditions the disordered aggregate will make the complex vanishg. At the interface with the complex, only some of the components in the aggregate (those that nucleated the chimeric structures, such as $\alpha$ and $\beta$ in Fig. 12) have specific interactions with the complex. The majority of the interface components have specific interactions with the aggregate, but not with the complex. Therefore, the majority of the components at the boundary of the complex, such as $\alpha$ in Fig. 12, are stabilized by three bonds (two with other surface components of the complex, $\gamma_{1}$ and $\gamma_{2}$ , and one with an interior component of the complex, $\gamma_{3}$ ), and have frustrated interactions with the interface components, $\gamma_{4}$ . If the promiscuity is very large, then a component $\beta$ , non-specific to the complex, may satisfy the compatibility condition established by three of the four neighboring components, and thus replace $\alpha$ at no energetic cost. If the three neighbors to which $\beta$ binds are the same as $\alpha$ , the process stops there. But if one of the neighbors corresponds to the disordered aggregate, $\gamma_{4}$ , then one of the components of the complex will loose the bond, see Fig. 12. This component will then become be less stable, and more susceptible of being replaced by an alternative component that does not belong to the complex. This process will eventually result in the disordered aggregate “dissolving” the complex, as seen in Fig. 2C. To estimate the scaling that corresponds to this process we note that $\beta$ establishes three specific bonds, and so it is at the intersection of three sets of size $\pi$ . The probability that these sets intersect, $P_{\cap}$ , is now given by

[TABLE]

which by setting $P_{\cap}\approx 1$ directly gives

[TABLE]

for the case $z=4$ . Generalization of this expression to arbitrary $z$ requires noting that an an outer surface of a $z/2$ dimensional complex has $z-1$ neighbors conditioning each of its components. Therefore, it is the intersection of $z-1$ sets that has to be considered, which gives $K^{z-1}\lesssim h^{2z-3}M^{z-2}/(1-a)^{2z-3}$ .

.4 Finite size scaling

The transition from MA into Ch is characterized by Eq. 15, and that from Ch into L by Eq. 19. These equations predict particular size scaling relations of $a$ , $h$ , and $K$ with $M$ . Denoting with asterisk subscripts the values at the boundaries, we define the following exponents:

[TABLE]

With these definitions, Eq. 15 implies $\nu=1/2$ , $\mu=1/3$ , $\gamma=1/3$ , while Eq. 19 implies $\bar{\nu}=2/3$ , $\bar{\mu}=2/5$ , $\bar{\gamma}=2/5$ . To numerically confirm these analytical predictions, we performed three classes of numerical simulations, which we describe below. In order not to make any assumption about the form of the scaling function, we used the procedure developed in Houdayer and Hartmann (2004). In short, consider $s_{\rm tot}$ datasets $D_{1},\ldots,D_{s_{\rm tot}}$ , each corresponding to a different system size (here complex size, $M$ ) and each composed by a set of points $i=1,\ldots,t_{s}$ . For each of the points there exists a value of the scaling parameter, $x_{i}$ (for example, the heterogeneity $h$ ); a value of the function, $g_{i}$ , whose scaling we want to study (e.g., the assembly error); and a standard deviation for the value of the function, $\delta_{i}$ (here calculated from different replicate simulations). That is, we have $s_{\rm tot}$ datasets $D_{s}=\{x_{i},g_{i},\delta_{i}\}_{1}^{t_{s}}$ , which in our case are generated through numerical simulations. Provided these datasets, the procedure in Houdayer and Hartmann (2004) provides a quality function, $S(\mu)$ , of the scaling exponent, $\mu$ . This quality function measures the mean square distance of the rescaled values of the function in units of standard errors. The estimated value of the exponent is obtained by minimizing the quality function, with a good data rescaling corresponding to values below $S=2$ . The error in the exponent is obtained by calculating the value of the deviation in exponent for which the quality function increases by one.

Three classes of simulations were performed. In all of them we considered a $L=40\times 40$ lattice, and studied assembly sizes, $M=10^{2},11^{2},\ldots,32^{2}$ with the binding energy set to $E=7$ and the chemical potential to $\mu=12.6$ . Sixty replicate simulations were done for each parameter set, each with a duration of $10^{6}{\rm latt}$ . To determine each exponent, we used as the order parameter the error at the end of the simulation. We also used the saturation error in the Ch regime, given by Eq. 7, to rescale the error as explained below (only rescaled errors below $1.05$ were considered). We would like to emphasize that the range of assembly sizes that we considered in the simulations is insufficient to accurately determine the scaling of the regime boundaries. To accurately determine these scalings would require expanding the range of sizes to cover at least two orders of magnitude. The purpose of our simulations is simply to provide numerical support for our analytical arguments, i.e. for Eqs. 15 and 19. Importantly, cellular protein complexes have sizes of at most $M\sim 10^{2}$ , so the precise value of the exponents is unlikely to be biologically relevant.

In the first class of simulations we fixed the heterogeneity to $h=0.9$ and the sparsity to $a=0$ , i.e. $N=N_{\rm tot}$ . We then changed the number of coexisting complexes, $K$ , for various sizes of complexes, $M$ , in order to determine $\nu$ and $\bar{\nu}$ . It is convenient to introduce the loading variable $n\equiv K/M$ .

•

Determining $\nu$ . To calculate $\nu$ we used the scaling hypothesis ${\rm Err}/{\rm Err}_{\rm sat}=f(K/K_{*})=f(n/n_{*})=g(nM^{1-\nu})$ , and studied low values of the loading, $n\in[0,0.06]$ , so that the system would only cross the first regime boundary, MA to Ch. This is the transition studied in Murugan et al. (2015). In Fig. 13A we see that the error changes smoothly from near [math] to near its saturation value. Fig. 13B shows the fit quality function, which takes its minimum at $\nu_{\rm min}=0.42$ . By calculating the value for which the quality function increases by unity, we estimate $\delta\nu=0.07$ . This matches well the analytical prediction $\nu=1/2$ . In Fig. 13C we can see that with the numerically determined exponent there is a good data collapse.

•

Determining $\bar{\nu}$ . To calculate $\bar{\nu}$ we analyzed the same class of simulations as above, but now used the scaling hypothesis $(1-{\rm Err})/(1-{\rm Err}_{\rm sat})=f(K/K_{*})=f(n/n_{*})=g(nM^{1-\bar{\nu}})$ . We studied high values of the loading, $n\in[0.03,0.4]$ , so that the system would cross the second regime boundary, Ch to L. In Fig. 13D we see that the error changes smoothly from near its saturation value, at the end of the first transition, to near unity, in the liquid regime. Fig. 13E shows the quality function of the fits, which has a minimum at $\bar{\nu}_{\rm min}=0.69$ , with $\delta\bar{\nu}=0.07$ . This matches well the analytical prediction $\bar{\nu}=2/3$ . In Fig. 13F we can see that there is a good data collapse.

In the second class of simulations we fixed the number of complexes to $K=5$ and the sparsity to $a=0$ , i.e. $N=N_{\rm tot}$ . We then changed the heterogeneity, $h$ , for several complex sizes, $M$ , in order to determine $\mu$ and $\bar{\mu}$ .

•

Determining $\mu$ . We considered relatively high values of the heterogeneity, $h\in[0.4,1.0]$ , and used as scaling hypothesis ${\rm Err}/{\rm Err}_{\rm sat}=f(h/{h}_{*})=g(hM^{\mu})$ . Fig. 14A shows the data before the collapse, and Fig. 14B shows the quality function, which has a minimum at $\mu_{\rm min}=0.28$ with $\delta\mu=0.06$ . This compares favorably with the theoretical prediction, $\mu=1/3$ . In Fig. 14C we can see that for the numerically determined exponent there is a good collapse of the data. This collapse is better than that shown in Fig. 3B and C, which was done for the limiting case $K=1$ . One reason for this is that in Fig. 3C the collapse was done with the analytically predicted exponent, not with the numerically determined one. Another contributing factor is that for $K=1$ the promiscuity is smaller, which can blur the transition.

•

Determining $\bar{\mu}$ . We now considered the same class of simulations as in the previous paragraph, but in the range $h\in[0.0,1.0]$ and use the scaling hypothesis $(1-{\rm Err})/(1-{\rm Err}_{\rm sat})=f(h/{h}_{*})=g(hM^{\bar{\mu}})$ . The calculated exponent was $\bar{\mu}_{\rm min}=0.38$ , with $\delta\bar{\mu}=0.05$ . This compares well with the theoretical prediction, $\bar{\mu}=2/5$ . As before, Figs. 14D to F show the raw data, the fit quality function, and the data collapse, respectively.

In the third class of simulations we fixed the number of complexes to $K=100$ and the heterogeneity to $h=1$ , i.e. $M=N$ . We then changed the sparsity in order to determine $\gamma$ and $\bar{\gamma}$ . It was convenient to perform the analysis using $1-a$ and not $a$ , as this is the quantity for which the scaling exponents were predicted.

•

Determining $\gamma$ . We considered relatively high values of the sparsity, so that $1-a\in[0.0,0.3]$ , which captures the transition from Ch to MA, see Fig. 3A. We used as scaling hypothesis ${\rm Err}/{\rm Err}_{\rm sat}=f((1-a)/(1-{a}_{*}))=g((1-a)M^{-\gamma})$ . Fig. 15A shows the data before the collapse, and Fig. 15B shows the quality function, which has a minimum at $\gamma_{\rm min}=0.29$ with $\delta\gamma=0.03$ . This compares favorably with the theoretical prediction, $\gamma=1/3$ . In Fig. 15C we can see that for the numerically determined exponent there is a good collapse of the data

•

Determining $\bar{\gamma}$ . We now considered the same class of simulations as above, but in the range $1-a\in[0.1,1.0]$ and using the different scaling hypothesis $(1-{\rm Err})/(1-{\rm Err}_{\rm sat})=f((1-a)/(1-{a}_{*}))=g((1-a)M^{-\bar{\gamma}})$ . The calculated exponent was $\bar{\gamma}=0.42$ , with $\delta\bar{\gamma}=0.06$ . This compares well with the theoretical prediction, $\bar{\gamma}=2/5$ . As before, Figs. 15D to F show the raw data, the fit quality function, and the data collapse, respectively.

.5 Boundaries between regimes in $(h,K)$ and $(a,h)$ space

Fig. 3A shows a portrait in $(a,h,K)$ space of the different possible regimes for our model using Eqs. 15 and 19. In section .4 we verified the size scaling of $(a,h,K)$ predicted by these expressions, which is an indirect confirmation of the functional form of the boundaries that separate the different regimes. A more direct confirmation can be done by comparing numerically determined two dimensional cross-sections of this three dimensional parameter space. These cross-sections can then be compared to the predicted theoretical boundaries that separate the different regimes.

In Fig. 16 we show two such (orthogonal) sections, $a=0$ in A , and $K=8$ in B. A third section, $h=1$ , appears in Fig. 3D. All three cross-sections show boundaries that match well the scaling behaviors predicted by Eqs. 15 and 19, and schematically depicted in Fig. 3A. Note that in order to determine the boundaries that separate the different regimes we needed to determine the exact pre-factors for the scaling relations. In Fig. 3D this was done by fitting sigmoidal functions on the $a=0$ line in the same way as described before for the boundaries in $(E,\mu)$ space. The values of $K$ obtained for the transition from MA into Ch and from Ch into L were $9.2\pm 0.1$ and $32.6\pm 0.3$ , respectively. These same values were then used to characterize the transition boundaries in Fig. 16A. In the case of Fig. 16B, besides the aforementioned values, we also used sigmoidal fits to characterize the offset of the linear boundaries. We found that, for a sparsity equal to $0.5$ the transitions from L to Ch and from Ch to MA occur at $h=0.19$ and $h=0.38$ , respectively.

.6 Variability in $E$ and $\mu$

In the main text of this paper we have taken all specific interactions between components to have the same strength, $E$ , and the chemical potentials of all species also to be same, $\mu$ . This is clearly a simplifying assumption, as in the cellular environment we expect a large variability in these parameters. To explore the role of such a variability, we sampled the chemical potential of each species, $\mu_{\alpha}$ , and the energy of interaction between two component species that are neighbors in a given complex, $U_{\alpha\beta}$ , from two normal distributions, $\mathcal{N}(\bar{\mu},\sigma_{\mu}^{2})$ and $\mathcal{N}(\bar{E},\sigma_{E}^{2})$ , respectively. We first fixed all the parameters to values such that the system finds itself in the MA regime, and systematically varied the values of $\sigma_{\mu}$ and $\sigma_{E}$ .

As one can see in Fig. 17, for very low values of the variability, the system is in the MA regime, as expected. Reliable assembly remains robust for values of the energy standard deviation up to $\sigma_{E}\approx 0.1\bar{E}$ , and values of the chemical potential standard deviation up to $\sigma_{\mu}\approx 0.1\bar{\mu}$ . Beyond this, the variability induces a transition into the chimeric regime, but not into the dilute solution regime, which also borders MA, see Fig. 2. Note that in panels C and D in Fig. 17 the transition into Ch occurs for larger values of the variability, so that the robustness of MA is increased. The reason for this is that in C and D the mean value of the energy (see caption) is larger, i.e. one is “deeper” in the MA regime, see Fig. 2E. Therefore, the same relative variation as in panels A and B, Fig. 17, will now result in the values of energy for which MA is stable. We leave open the systematic study of these variability induced transitions, as well as their possible role for the cellular environment.

.7 Variability in $M_{c}$

In the main text of the paper we have considered the case in which all complexes have the same size, $M=M_{c}$ . We now discuss the effect of size variability in the MA regime, i.e. its effect on the reliability of assembly. Importantly, to isolate the role of size variability in MA we have to constrain to fixed values the intensive properties of complexes, such as $h_{c}$ and $a_{c}$ . Keeping this in mind, we expect that size variability has a minor effect on MA. The reason is that the constraint of Eq. 15 on MA arises from the promiscuity of components, Eq. 13, and in terms of promiscuity it makes no difference (up to boundary effects) whether components are distributed in one large complex, or in several small ones. Therefore, as long as the average complex size is large enough to disregard boundary effects, we expect our results to be robust. To see this quantitatively, consider the heterogeneity of complexes fixed at $h_{c}=1$ , without any loss of generality. Following .3, the promiscuity of components can be then estimated as $\pi\approx\sum M_{c}/N_{\rm tot}=K\langle M_{c}\rangle/N_{\rm tot}$ , which suggests that our conclusions are not affected by the size variability.

In order to confirm this conclusion, we performed numerical simulations, see Fig. 18. In particular, we sampled the size of complexes, $M_{c}$ , from a normal distribution with mean $\bar{M}$ and different values of the variance $\sigma^{2}_{M}$ . We constrained the heterogeneity to $h_{c}=1$ by keeping $N_{c}=M_{c}$ , and truncated the distribution to values $M_{c}<N_{\rm tot}$ . Truncating the normal distribution affects the average size of complexes for large values of the variance: while for $\sigma_{M}\ll\bar{M}$ we have $\langle M_{c}\rangle=\bar{M}$ , for $\sigma_{M}\gg\bar{M}$ we have $\langle M_{c}\rangle=N_{\rm tot}/2$ . This increase in average complex size for large values of the variance will affect promiscuity, and therefore alter the MA regime. Therefore, we expect that for $\sigma_{M}\lesssim\bar{M}$ the MA regime remains robust, as can indeed be seen in Fig. 18. Conversely, for $\sigma_{M}\gtrsim M$ we expect that the MA regime is shrunk, as can be also observed in Fig. 18. Following .3 it is straightforward to see that $K\lesssim\langle M_{c}\rangle^{-1}$ , which corresponds to the red line in Fig. 18 ( $\langle M_{c}\rangle$ was here empirically determined for each value of $\sigma_{M}$ ). This confirms that the shrinkage of the MA regime is purely due to a change in the average size of complexes. Therefore, our results are robust to size variability, and they should remain valid provided that one simply replaces $M$ by $\langle M_{c}\rangle$ .

Database analysis

To quantify the characteristics of complexes we analyzed publicly available databases. Broadly speaking, at the present time there are three classes of such databases for protein complexes: manually curated databases, in which the inclusion of new protein complexes is done by curators according to some pre-established criteria Meldal et al. (2014); Pu et al. (2008); databases generated by clustering of protein interaction networks Nepusz et al. (2012), themselves generated through some pre-established experimental protocol, such as TAP-MS Gavin et al. (2006); Krogan et al. (2006); Wan et al. (2015); and databases obtained through meta-analysis of several high-throughput datasets Hart et al. (2007); Drew et al. (2017). Although protein complex databases have been generated for more than a decade, they still show important weaknesses. On the one hand, manually curated databases are not comprehensive, the inclusion criteria suffer from some arbitrariness, and their growth is biased by the interests of the scientific community. On the other hand, clustering of protein interaction networks depends on the clustering algorithm, and the interaction networks themselves are known to commonly include false-positive and false-negative interactions. As far as possible, we were able to verify the conclusions of our model using several databases. However, the intrinsic limitations of protein complex databases should be kept in mind, and the conclusions derived from these databases should be taken with all necessary precautions.

.8 Heterogeneity

In order to quantify the heterogeneity of protein complexes, we used a database with stoichiometric information. To our knowledge, the most comprehensive databases with this type of information are those in the manually curated Complex Portal Meldal et al. (2014, 2018). In particular, the database of Saccharomyces cerevisiae, which was used to generate Fig. 4, is the most complete, containing over 500 complexes. Unfortunately, not all these complexes contain full stoichiometric information. In particular, for $301$ complexes stoichiometric information is completely absent, for $50$ it is only partial (i.e. it is known only for some of its components), however for $238$ complexes in this dataset there exists full stoichiometric information. In order not to create systematic biases in the dataset, we constrained ourselves to this last set of complexes with full stoichiometric information. Ultimately, we identified $K=238$ complexes with complete stoichiometry formed by $N_{\rm tot}=746$ components (we excluded small molecules, annotated using the “ChEBI” label, such as Mg2+), which conforms dataset $I$ . The characteristics of this and other datasets are summarized in Table 1.

In Fig. 19A we plot histograms for the complex heterogeneity using a bin size half of that used in Fig. 4A. As one can see, there is a large abundance of high-heterogeneity complexes, $h_{c}>0.9$ , which confirms the observation of Fig. 4. The distribution is dominated by fully heterogeneous complexes, $h_{c}=1$ , of which there are $149$ . Low heterogeneity complexes, $h_{c}<0.2$ , are rare. An example of a low heterogeneity complex is the fatty acid synthase, a dodecamer with only two species (CPX-1162, PDB: 2UV8). Importantly, there is a persistent presence of complexes with intermediate heterogeneities, $h_{c}\in[0.5,0.6)$ . These are dominated by complexes with heterogeneity $h_{c}=1/2$ , of which there are $43$ . For example, the Gal3-Gal80 transcription regulation complex, with two copies of each of its two components (CPX-1042, PDB: 3V2U).

To determine whether these mid-heterogeneity complexes have a relationship with geometrical symmetries in complexes, we cross-referenced dataset $I$ with the protein data bank Berman et al. (2000). We repaired some inconsistencies within the database, such as outdated PDB entries or entries referering to molecular dynamics simulations (these changes were discussed with Complex Portal curators, and should be included in future releases). The result was dataset $II$ . For each complex in this dataset we looked at the symmetry group provided by the PDB website, which is calculated using alignment of repeated proteins within the complex, as described in https://www.rcsb.org/pages/help/advancedsearch/pointGroup. We then separated the complexes in two sets, those that are asymmetric and those that exhibits a certain symmetry. In Fig. 19B we plot a histogram of the heterogeneity of asymmetric and symmetric complexes for a smaller bin size than in Fig. 4D. As one can see, asymmetric complexes are much more heterogeneous than symmetric ones. In particular, the large peak in $h_{c}\in[0.5,0.6)$ only occurs for symmetric complexes.

.9 Protein participation

We defined the participation of a protein species $\alpha$ , $q_{\alpha}$ , as the number of complexes in which $\alpha$ takes part. Importantly, to determine this quantity the only information about complexes needed was their composition (no stoichiometric or structural information is relevant here). We thus defined dataset $III$ , larger than $I$ or $II$ , as the raw Complex Portal dataset (excluding, again, small molecules). For this dataset $III$ we quantified $q_{\alpha}$ through a histogram, Fig. 20A (blue), which included protein species that participate in seven or more complexes (there are $29$ such proteins). An example of the latter is Cdc28 (P00546), the yeast cyclin dependent kinase, which participates in nine complexes. We also quantified the number of species per complex, $N_{c}$ , which spanned about two orders of magnitude. Given that both quantities spanned a wide range of values, we asked whether a simple null model could explain the histogram of $q_{\alpha}$ using the histogram of $N_{c}$ as input. To answer this question, we generated $K$ artificial complexes by randomly sampling the $N_{\rm tot}$ different protein species into sets with $N_{c}$ different species, where $N_{c}$ was taken from the original dataset. In this way, the resulting “random” dataset was guaranteed to generate the same histogram of $N_{c}$ . However, as shown in Fig. 20A (green), the histogram of $q_{\alpha}$ for the random dataset deviated from that of dataset $III$ , not recovering the high participation of some proteins.

If confirmed, this is a somewhat surprising conclusion in view of the recent observation in Ref. Mazzolini et al. (2018) implying that for several complex systems the distribution of $q_{\alpha}$ (therein called “occurence”, up to normalization) could be explained by a random sampling of components into sets of sizes given by the empirically observed compositions (therein including stoichiometry). Our procedure is analogous to that of Ref. Mazzolini et al. (2018), the only difference being that we disregarded stoichiometry. Our initial expectation was that the random sampling approach would recover the distribution of $q_{\alpha}$ . However, the disparity depicted in Fig. 20A suggests that the participation of proteins cannot be explained by randomly sampling the composition of complexes.

We sought to confirm this conclusion by analyzing alternative datasets. Once the need for stoichiometric information is lifted, there are several available databases of protein complexes. We used two additional datasets for S. cerevisiae. To remove the possibility that our conclusions are an artifact of manual curation, we used the high-throughput data from Krogan et al. (2006) to generate dataset $IV$ . In particular, we applied the recent ClusterOne clustering algorithm from Nepusz et al. (2012) to the data from Krogan et al. (2006) in order to generate dataset $IV$ . ClusterOne has been shown to outperform other standard clustering algorithms (note also that the commonly used MCL algorithm Enright et al. (2002) does not allow for complexes to share components). In particular, we used the weighted version of the algorithm, setting the probability of interaction as weight (the parameters of the algorithm were the size and threshold of the detected complexes, taken to be $2$ and $0.3$ , respectively). To generate dataset $V$ we used the meta-data from Hart et al. (2007), which combines the raw data from Krogan et al. (2006) and Gavin et al. (2006) and assigns a $p-$ value for the interactions. Again, we clustered the data using the ClusterOne algorithm with weights given by $1-p$ (parameters as before). The histograms of $q_{\alpha}$ for datasets $IV$ and $V$ are shown in panels B and C of Fig. 20. Panel B exhibits a very similar trend to panel A, with highly participatory proteins that cannot be explained by our simple null model. Panel C, however, is compatible with the random model.

To further support the observation of a broad distribution of $q_{\alpha}$ that cannot be explained by the random model, we sought larger datasets, with more protein species and more complexes. Datasets $VI$ , $VII$ , and $VIII$ include metazoan data, and are all significantly larger than the yeast datasets. Dataset $VI$ is the manually curated Corum 3.0 dataset Giurgiu et al. (2018), which incorporates data of humans ( $67\%$ ), mouse ( $15\%$ ), and rat ( $10\%$ ). Specifically, it corresponds to the “core” dataset, in which cross-species redundancies are removed (this was the dataset used in Fig. 5). Dataset $VII$ corresponds to the high-throughput experiments carried out and analyzed in Wan et al. (2015). Here, the data was clustered into complexes using two steps: first, a machine learning classifier trained with part of the Corum dataset; second, the ClusterOne algorithm. Finally, dataset $VIII$ corresponds to the meta-analysis of several human high-throughput experiments Drew et al. (2017), with data from Wan et al. (2015) as well as Hein et al. (2015) and Huttlin et al. (2015). The data was again classified using a two-step procedure similar to the one used for $VII$ . The histograms for datasets $VI-VIII$ are shown in panels D, E and F of Fig. 20. In all these cases the distribution of the protein participation is broad, and deviates significantly from the “random model”. This provides further support to the hypothesis that the statistics of participation of proteins in complexes can not be derived from the statistics of protein complex composition. It remains an open question what are the causes for the observed prevalence of highly participatory proteins.

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ban et al. (2000) Nenad Ban, Poul Nissen, Jeffrey Hansen, Peter B Moore, and Thomas A Steitz, “The complete atomic structure of the large ribosomal subunit at 2.4a resolution,” Science 289 , 905–920 (2000).
2Garcia-Seisdedos et al. (2018) Hector Garcia-Seisdedos, José A Villegas, and Emmanuel D Levy, “Infinite assembly of folded proteins in evolution, disease, and engineering,” Angewandte Chemie International Edition (2018).
3Reuveni et al. (2017) Shlomi Reuveni, Måns Ehrenberg, and Johan Paulsson, “Ribosomes are optimized for autocatalytic production,” Nature 547 , 293 (2017).
4Stein et al. (2009) Amelie Stein, Roland A Pache, Pau Bernadó, Miquel Pons, and Patrick Aloy, “Dynamic interactions of proteins in complex networks: a more structured view,” The FEBS journal 276 , 5390–5405 (2009).
5Antebi et al. (2017) Yaron E Antebi, James M Linton, Heidi Klumpe, Bogdan Bintu, Mengsha Gong, Christina Su, Reed Mc Cardell, and Michael B Elowitz, “Combinatorial signal perception in the bmp pathway,” Cell 170 , 1184–1196 (2017).
6Murray (2004) Andrew W Murray, “Recycling the cell cycle: cyclins revisited,” Cell 116 , 221–234 (2004).
7Garcia-Seisdedos et al. (2017) Hector Garcia-Seisdedos, Charly Empereur-Mot, Nadav Elad, and Emmanuel D Levy, “Proteins evolve on the edge of supramolecular self-assembly,” Nature 548 , 244 (2017).
8Johansson et al. (2012) Magnus Johansson, Jingji Zhang, and Måns Ehrenberg, “Genetic code translation displays a linear trade-off between efficiency and accuracy of trna selection,” Proceedings of the National Academy of Sciences 109 , 131–136 (2012).

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Towards a theory of assembly of protein complexes: lessons

Abstract

Introduction

Results

Multifarious mixtures of components exhibit four assembly regimes

Heterogeneity and sparsity constrain reliable assembly

Structural and compositional data points towards high heterogeneity of protein complexes

Discussion

Additional mechanisms to prevent chimeric assembly

Broad distribution of protein participation in complexes and dynamic control

Protein complexes as a distinct regime of matter

Materials and Methods

Model

Figure parameters

Theoretical analysis

.1 Computational model

.2 Boundaries between regimes in (E,μ)(E,\mu)(E,μ) space

.3 Scaling laws

.4 Finite size scaling

.5 Boundaries between regimes in (h,K)(h,K)(h,K) and (a,h)(a,h)(a,h) space

.6 Variability in EEE and μ\muμ

.7 Variability in McM_{c}Mc​

Database analysis

.8 Heterogeneity

.9 Protein participation

.2 Boundaries between regimes in $(E,\mu)$ space

.5 Boundaries between regimes in $(h,K)$ and $(a,h)$ space

.6 Variability in $E$ and $\mu$

.7 Variability in $M_{c}$