Distribution of phenotype sizes in sequence-to-structure genotype-phenotype maps
Susanna Manrubia, Jose A. Cuesta

TL;DR
This paper analytically derives the distribution of phenotype sizes in sequence-to-structure genotype-phenotype maps, revealing how different features influence the size distribution and implications for evolvability.
Contribution
It introduces models that interpolate between powerlaw and lognormal distributions, providing insights into the factors shaping phenotype size distributions.
Findings
Distribution of phenotype sizes varies between powerlaw and lognormal.
Features of the sequence-to-structure map determine the size distribution.
Models help understand evolvability and navigability of genotype space.
Abstract
An essential quantity to ensure evolvability of populations is the navigability of the genotype space. Navigability relies on the existence of sufficiently large genotype networks, that is ensembles of sequences with the same phenotype that guarantee an efficient random drift through sequence space. The number of sequences compatible with a given structure (e.g. the number of RNA sequences folding into a particular secondary structure, or the number of DNA sequences coding for the same protein structure) is astronomically large in all functional molecules investigated. However, an exhaustive experimental or computational study of all RNA folds or all protein structures becomes impossible even for moderately long sequences. Here, we analytically derive the distribution of phenotype sizes for a hierarchy of models which successively incorporate features of increasingly realistic…
| Symbol | Definition |
|---|---|
| Sequence or genotype length | |
| Alphabet size | |
| Versatility of site | |
| Number of sites in the low-versatility class | |
| Number of sites in the high-versatility class | |
| Size of a phenotype | |
| Number of phenotypes with the same size | |
| Set of -genotypes | |
| Rank of a phenotype | |
| Probability density that a phenotype has size | |
| Number of phenotypes from different ordering of sites |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Distribution of genotype network sizes in sequence-to-structure
genotype-phenotype maps
Susanna Manrubia
Grupo Interdisciplinar de Sistemas Complejos (GISC), Madrid
Dept. de Biología de Sistemas, Centro Nacional de Biotecnología (CSIC), Madrid, Spain
José A. Cuesta
Grupo Interdisciplinar de Sistemas Complejos (GISC), Madrid
Dept. de Matemáticas, Universidad Carlos III de Madrid, Leganés, Madrid, Spain
Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)
Universidad de Zaragoza, Spain
UC3M-BS Institute of Financial Big Data (IFiBiD), Universidad Carlos III de Madrid, Getafe, Madrid, Spain
Abstract
An essential quantity to ensure evolvability of populations is the navigability of the genotype space. Navigability, understood as the ease with which alternative phenotypes are reached, relies on the existence of sufficiently large and mutually attainable genotype networks. The size of genotype networks (e.g. the number of RNA sequences folding into a particular secondary structure, or the number of DNA sequences coding for the same protein structure) is astronomically large in all functional molecules investigated: an exhaustive experimental or computational study of all RNA folds or all protein structures becomes impossible even for moderately long sequences. Here, we analytically derive the distribution of genotype network sizes for a hierarchy of models which successively incorporate features of increasingly realistic sequence-to-structure genotype-phenotype maps. The main feature of these models relies on the characterization of each phenotype through a prototypical sequence whose sites admit a variable fraction of letters of the alphabet. Our models interpolate between two limit distributions: a power-law distribution, when the ordering of sites in the prototypical sequence is strongly constrained, and a lognormal distribution, as suggested for RNA, when different orderings of the same set of sites yield different phenotypes. Our main result is the qualitative and quantitative identification of those features of sequence-to-structure maps that lead to different distributions of genotype network sizes.
Keywords: genotype-phenotype map, neutrality, RNA, phenotype size, evolution
1 Introduction
How genotypes map into phenotypes counts amongst the most essential questions to understand how evolutionary innovations might come about and how evolutionarily stable strategies are fixed in populations. With some of its features seemingly dependent on the system studied and on the description level considered, the genotype-phenotype (GP) map appears far from trivial. Many studies have addressed the effect of mutations on phenotype: point mutations [1, 2, 3], genome fragment deletion [4], duplication or inversions, or the knockout of specific genes [5] —among others— may or may not have an effect at the molecular, metabolic, regulatory, or organismal level [6]. Also, the ability of genotypes to yield more than one phenotype is a main resource of molecular adaptation [7, 8]. The probability of expressing different phenotypes or of experiencing mutations that modify the current phenotype depends on the structure of the GP map, which eventually determines how the space of function is explored, and what are the chances that a population survives or innovates in the face of endogenous or exogenous changes [9, 10, 11, 12, 13, 14].
Most models are restricted to the many-to-one realisation of the GP map, and thus assume that adaptation is dominated by mutations. There is a plethora of different model systems studied under this assumption. Despite seemingly relevant underlying molecular differences, those models present a remarkable number of common properties. Exhaustive research on the GP map was pioneered by studies of RNA sequence-to-secondary-structure mappings. Most topological properties identified in RNA spaces are shared by other simple systems, such as the existence of huge genotype networks, the increase in phenotype robustness with the size of the latter, and a very skewed distribution of network sizes. The set of genotypes that yield the same phenotype typically forms a network, since those genotypes are pairwise connected through mutations. Sufficiently large genotype networks so defined were postulated as a condition for the navigability of sequence space long ago [15]. Subsequent studies have shown that such large networks do exist, and that the difference in sequence between genotypes in those networks can be as large as the difference between two random sequences [16, 17, 18, 2]. Phenotype robustness refers to the average effect of point mutations in the genotypes of a specific genotype network. It has been shown to grow logarithmically with the size of the phenotype in RNA [19], in a self-assembly model of protein quaternary structure [20] and in simple models for protein folding [13]. The existence of qualitative and quantitative statistical properties of the GP maps shared by apparently dissimilar systems suggests that they might arise from basic universal features [21, 13]. Though genotype networks are not always fully connected, they do traverse the whole space of genotypes for sufficiently abundant phenotypes, thus ensuring high navigability [22, 23]. Even in cases where genotype networks are fragmented, those fragments could be mutually reached if the GP map is many-to-many. The existence of “promiscuous” sequences that map into more than one phenotype enhances navigability and promotes fast adaptation [7, 14].
The statistical property of GP maps that has attracted the most attention is very likely the distribution of genotype network sizes, or phenotype sizes for short. Due to the astronomically large sizes of genotype spaces, initial estimations of the size of phenotypes were performed through random samplings of genotype space. The results were often represented as frequency-rank plots, with phenotypes ordered according to their sizes. Random samplings of genotype spaces in many-to-one GP maps invariably yielded some very abundant phenotypes and a large number of phenotypes represented by a few or just one genotype [24, 25]. Often, a frequency-rank plot was fitted to a generalized Zipf’s law [26], implying a power-law-like distribution of phenotype sizes. However, subsequent studies demonstrated that the frequency-rank plot of phenotype sizes actually had a more complex functional shape [27, 28, 29, 30], and specific functional fits were avoided. Subsequent studies have exhaustively mapped the complete sequence space to its corresponding phenotypes, among which RNA sequence-to-minimum energy secondary structure map [28, 31], the hydrophobic-polar (HP) model for protein folding [29, 2], or toyLIFE, which includes a sequence-to-structure-to-function description [30]. As a result, complete phenotype size distributions (for short sequences) are now available. Fitted shapes range from power-law-like curves [32] to lognormal distributions [31].
It has been argued that, among other generic properties, a skewed distribution of phenotype sizes results from the organization of biological sequences into constrained and unconstrained parts. In [33], the authors introduce the Fibonacci GP map, a many-to-one artificial model, where sites in a sequence can be coding or non-coding, and either lead to new phenotypes under mutations (coding sites) or yield the same phenotype (neutral, non-coding sites). The model can be analytically solved and yields a power-law phenotype size distribution, in qualitative agreement with some observations.
In this contribution, we attempt an identification of the elements in the organization of sequences that characterize the quantitative properties of the distribution of phenotype sizes. We show in a constructive fashion that the model in [33] is an example of a broad spectrum of sequence-to-structure GP models. Starting with the simplest case, where sequences are separated into constrained and neutral parts, and adding subsequent elements in the organization of the sequences and versatility levels of the sites, we show how the distribution of phenotype sizes changes from pure power-law (with an exponent dependent on how genotypes are distributed among phenotypes) to lognormal. This functional form is independent of whether the GP map is many-to-many (sequences are promiscuous) or many-to-one (the phenotype can be uniquely predicted from the sequence). Our final example corresponds to the RNA sequence-to-secondary structure map, where we demonstrate that the combinatorial properties of the distribution of sites of variable neutrality along sequences causes the distribution of phenotypes to follow a lognormal distribution, with parameters that can be traced to properties of the genotype set. Our main result is that a lognormal distribution of phenotype sizes is the expected result in any GP map where sufficient variation in the number of phenotypes of similar size is present.
2 Definitions
We will study four models that interpolate between the simplest case of sequences divided into neutral and non-neutral sites separated into two groups and a general case (represented by RNA), and calculate for each of them the size of a phenotype given the sequence organization of its corresponding genotypes, the number of phenotypes with the same size, the frequency rank ordering of phenotypes, and eventually the distribution of phenotype sizes. Table 1 summarizes the nomenclature and definitions used in this work, and Figure 1 illustrates some relevant quantities.
The genotype space is made of sequences of length letters from an alphabet of size . Two examples of alphabet sizes are for a binary alphabet and for DNA or RNA, A, C, G, T or U. The versatility of site is defined as the average number of different letters of the alphabet that can occupy a given sequence position . In general for all sites . This is a quantity closely related to neutrality. We will study the simplified case where sites can take one out of two different values, , with . Sites are called constrained if , and neutral if . We will use to count the number of sites with low versatility.
The size of a phenotype is the number of different genotypes compatible with that phenotype. From the definition of it follows that is a nonincreasing function of . In the literature, phenotype frequency [33], number of sequences for a phenotype [32] or neutral set size [31] have been used with a meaning identical to phenotype size here. The set of -genotypes is defined as the number of genotypes compatible with -phenotypes, . The rank of the first phenotype in size class is . Note that the total number of phenotypes coincides with the maximum rank.
If is the probability density that a phenotype has size , then we can count phenotypes as
[TABLE]
To first order we can approximate the integral as (the approximation gets better the smaller ). Thus, up to a normalisation constant, p(S)\propto C\big{(}\ell(S)\big{)}\left|S^{\prime}\big{(}\ell(S)\big{)}\right|^{-1}.
The probability density yields the probability of finding a phenotype with size when uniformly sampling over phenotypes. This corresponds to the distribution , as defined in other studies [31].
Finally, we will also introduce a factor to represent the fraction of -genotypes that actually go to a given -phenotype. This factor arises from additional restrictions in the assignment of genotypes to phenotypes which are not made explicit in the models. In general, if the models we are going to introduce assign the same genotypes to several -phenotypes. This would correspond to a many-to-many GP map —a sort of maps suitable to describe molecular promiscuity. Incidentally, molecular promiscuity strongly enhances navigability in genotype space [7, 8, 14]. Other choices may account for specific restrictions in the models; in particular, a suitable choice of may render the GP map many-to-one. We will return to this point when we provide details of the models.
A succint definition of the hierarchy of models introduced in this work is as follows:
- •
Model 1: Constrained and neutral sites occupy fixed positions. Sequences are separated in two parts, the first one of length occupied by constrained sites, , and the second part of length occupied by neutral sites, . Two minor variants considered are (i) phenotypes are all viable and (ii) lethal mutations occur independently of the site class.
- •
Model 2: Constrained and neutral sites occupy variable positions. This is illustrated by means of two examples: (i) constrained sites are split into two fragments at the beginning and at the end of the sequence and (ii) constrained sites can occupy arbitrary positions in the sequence.
- •
Model 3: Versatile sites occupy fixed positions. Two different types of sites with fixed versatilities and are considered.
- •
Model 4: Versatile sites occupy variable positions: RNA. In a first approximation, RNA sequences contain two types of sites that occupy different positions in the sequence subject to secondary structure constraints: those forming pairs (stacks) in the secondary structure have average versatility , and those unpaired (loops) have average versatility . The model can be generalized to an arbitrary number of site classes.
Figure 2 schematically represents the different models analysed here and some properties that will be of relevance to understand the distributions of phenotype sizes they yield.
3 Results
3.1 Model 1: Constrained and neutral sites occupy fixed positions
This is probably the simplest non-trivial model in the class of GP maps, very similar in spirit to that presented in [33]. Phenotypes are characterized by constrained sites in the first part of the sequence. For a fixed , mutations in a constrained site change the phenotype, and mutations in neutral sites yield genotypes compatible with the phenotype. Therefore,
[TABLE]
Note that, if , the complete genotype space is partitioned among -phenotypes for every value of . This implies that, if we consider all possible phenotypes (i.e. all values), a particular genotype is simultaneously compatible with many different phenotypes —representing a highly promiscuous sequence. Specifically, if the total number of genotypes compatible with -phenotypes is , so the total amount of genotypes . This result clearly shows the many-to-many nature of the GP map of this model with this choice of —genotypes are assigned to all phenotypes they are compatible with and, therefore, are repeatedly counted.
A minimal rule to avoid multiple assignments is to think of as the probability that a genotype is actually assigned to an phenotype. When this probability is uniform, , and we choose , the total number of genotypes becomes , the size of the genotype space, so the resulting map is effectively many-to-one. Other examples in which depends on will appear later.
Now, to obtain size as a function of rank we must eliminate in and substite it into to get . In this case, from Eq. (3) and assuming ,
[TABLE]
and substituting in (1)
[TABLE]
To obtain the probability density we first notice that Eq. (1) implies , hence . On the other hand , thus
[TABLE]
Hence the probability distribution is a power-law with exponent .
3.1.1 Non-viable genotypes arise from uniformly distributed lethal
mutations
In the same scenario as above, let us assume that a fraction of mutations is lethal, thus leading to a non-viable genotype. In this case, Eqs. (1) to (3) are identical, with substituted by . Therefore, and are as above with the latter change. This result shows that the existence of a non-viable class to which viable genotypes can mutate does not necessarily imply relevant functional changes in the distribution of phenotypes, which is in either case of the form , with . The effect of uniformly distributed lethal mutations could be therefore absorbed as a constant into . The situation changes if mutations are not distributed uniformly, but their likelihood depends on . This would be a particular realisation of Model 3 introduced below.
3.2 Model 2: Constrained and neutral sites occupy variable positions
In any realistic model (e.g. the case of RNA) the position of constrained and neutral sites should matter in the definition of a phenotype. While does not change its functional form as a result, does (and as a consequence), causing potentially relevant modifications in and . In general, the number of different phenotypes would take the form , where accounts for changes in the letter of the constrained site (yielding a different phenotype, as assumed) and is a model-dependent combinatorial number that counts the different ways in which the sites can be arranged to yield meaningful (and different) phenotypes. In general, the factor in stems from mutations in neutral sites, while the arrangement of constrained and neutral sites along the sequence is weighted by Q\big{(}L,\ell(S)\big{)}, with effects on the functional form of that, in general, depend on the permitted arrangements. As will be shown, might enormously increase the number of phenotypes and, especially, the relative abundances of -phenotypes.
3.2.1 Constrained sites are split into two groups at the extremes of
the sequence
As a way of example, let us consider one of the simplest situations where the position of the constrained sites matters. Suppose that those sites can be split into two groups with lengths and and placed at the beginning and at the end of the sequence (such that and ). This gives different phenotypes with constrained sites, and
[TABLE]
From these expressions we can obtain (see Appendix A) the asymptotic (for large ) rank distribution
[TABLE]
and the size probability density
[TABLE]
with and some constants.
Therefore, even in this simple case with quite a limited number of possible organization of constrained sites, and are no longer pure power-laws, though the dominant term of the phenotype size distribution (size still dominated by mutations in neutral sites) is characterized by an exponent . The total number of genotypes compatible with -phenotypes is also modified, , and is seen to increase linearly with .
3.2.2 Constrained sites can occupy any position in the sequence
We now assume that the constrained and unconstrained sites can occupy any site of the chain. In that case
[TABLE]
with no simple expression for . Let us focus, however, on the size distribution , and consider the case where . Asymptotically for
[TABLE]
Changing to through
[TABLE]
and writing , we finally obtain
[TABLE]
a log-normal distribution with mean and variance , very different from the distribution of the previous cases.
This section presents an example of a main result of this study. It shows that, when the definition of the phenotype depends on the specific position of constrained and neutral sites in sequences, the functional form of (and, in consequence, of ) qualitatively changes. In particular, the exponential growth of with dominates , which takes the form of a lognormal distribution. Other quantities defining the GP map, such as or , change now the parameters of the distribution, but do not modify its shape.
3.3 Model 3: Versatile sites occupy fixed positions
The models analysed above demonstrate that when sites are either constrained or neutral, the exponent associated to the power-law part of is . As we show next, this exponent is modified when the sites in the sequence show intermediate degrees of versatility, which causes the number of -genotypes to depend on .
Let us consider the case where the sites are just less constrained than the sites, such that the former admit an average of different letters of the alphabet and the latter admit , with . Relevant functions read
[TABLE]
with .
As it can be readily seen by substitution, these expressions reduce to Model 1 for and . Now,
[TABLE]
yielding
[TABLE]
For large this scales as , where depends on and as
[TABLE]
yielding in the limit of Model 1. Substituting this expression into Eq. (17),
[TABLE]
hence, up to a constant factor,
[TABLE]
Again maintains its power-law shape but its exponent depends on and .
The number of -genotypes now becomes
[TABLE]
This number can either increase or decrease with depending on whether is larger or smaller than 1. Both situations are possible under the constraint . The values of and change in response to possible enrichements or depletions in the total number of assigned genotypes with . This is a first example of similar cases encountered later in this work and in the literature, as we discuss later.
3.4 Model 4: Versatile sites occupy variable positions: RNA
In a first approximation (which has been shown to yield acceptable fits to data [19]), RNA sequences can be divided into two classes of sites: those in stacks (bound) and those in loops (unbound), characterized by different degrees of neutrality (see e.g. [34] and Fig. 4 in [35]). Changes in the position of loops and stacks means a different phenotype. Additionally, the composition of each site in the sequence bears a significant correlation with the structural element it will preferentially represent in the phenotype (see Fig. 7 in [36]). Therefore, a first approximation to a GP representation of RNA involves elements in our previous Models 2 and 3. In the following, the abundances or phenotypes will be ruled by (averaged) values and of the number of letters that can be changed in stacks or loops, respectively (see Fig. 2), without affecting the phenotype.
Studies of RNA neutral networks and their related properties are usually restricted to the many-to-one mapping between sequence and structure. Despite the fact that any RNA sequence is compatible with multiple structures whose relative weight in an ensemble of identical sequences is defined by their folding energy [37], it is common practice to select only the minimum energy fold as the associated phenotype. This decision transforms an intrinsic many-to-many GP map where alternative phenotypes can be reached through mutations or promiscuity, into a many-to-one map where navigability is limited to the effects of neutral drift. Analytical approaches cannot include, in general, energetic considerations, so they implicitly work in the many-to-many unrestricted case. This situation is comparable to the assignation of sequences to structures we have performed in our models, where every sequence is assigned to all phenotypes it is compatible with, while possible restrictions in the assignments are encompassed in . The distribution of secondary structure sizes for the unrestricted map (i.e. all sequences compatible with a given secondary structure) fixing the number of stacks or loops has been derived in [38] for the general case of structures with pseudoknots, in [39] and [40], and in [41] in a form that will be used here.
3.4.1 Number of secondary structures with fixed number of pairs in
RNA
In this case will denote the number of pairs of nucleotides in stacks (, with if is odd and if is even), hence will be the number of nucleotides in loops (, which is the size of the minimal —hairpin— loop); is the probability distribution for secondary structures with paired nucleotides, for sequences of length (in the limit ). It has been shown [38, 39, 41] that this distribution behaves as a normal distribution in with mean and standard deviation . In the case that structures with stems with less than two base pairs or loops with less than three unpaired bases are forbidden —accounting for minimal energetic constraints— we obtain , , , and Note that different constraints will lead to different values of these quantities, but otherwise will not change the fact that is a normal distribution. Finally, the number of different phenotypes of a sequence of length with paired bases is given, in the limit , by
[TABLE]
3.4.2 Size distribution
In the case that the unpaired sites admit different letters and the paired sites letters (), the size of a phenotype is given by . Here, we will consider that a phenotype is formed by all sequences compatible with that phenotype, thus setting . We have
[TABLE]
Denoting
[TABLE]
and noting that , substitution of (27) into (26) yields the log-normal distribution
[TABLE]
3.4.3 Rank distribution
In the same two-sites approximation
[TABLE]
The functional form of the rank is derived in Appendix B. After some algebra we arrive at
[TABLE]
with constants and depending on parameters of the combinatorial factor , see Appendix B.
4 Discussion
The functional shape of the distribution of phenotype sizes is strongly dependent on the sequence organization within phenotypes. In a first approximation that discards the heterogeneity among genotypes in the same phenotype, one may describe that ensemble of sequences through a prototypic sequence whose sites admit a phenotype-dependent, variable number of letters of the alphabet, a quantity that we have dubbed versatility. The substitution of each sequence in a phenotype by the average over the phenotype seems a strong approximation. However, there is evidence that deviations from the average within a phenotype are small: the number of neutral neighbours of genotypes within a phenotype are tightly clustered around an average value characteristic of that phenotype size [19]. With this proviso, two main elements determine the corresponding distribution of phenotype sizes. The first one, generic for all systems, is the relationship between the size of a phenotype and the versatility of each site . In the framework used in this work, the size of a phenotype can be written in general as
[TABLE]
This product yields an intrinsic allometric relation between the size of a phenotype and the length of the sequence. The second element, specific of each sequence-to-structure map, is the number of phenotypes with similar size. This quantity takes the overall form
[TABLE]
with the combinatorial factor accounting for the number of ways in which an ensemble of sites with values can be arranged into meaningful phenotypes, and the product accounting for the number of neutral sequences within the phenotype. If the values of the combinatorial factor are constrained enough such that the asymptotic behavior of with is subdominant with respect to that of the product —as in Models 1 and 3— the distribution of phenotype sizes is a power-law. If, on the contrary, the dominant term is the combinatorial factor —in particular when the distribution of structural motifs converges to a Gaussian— the distribution of phenotype sizes becomes a lognormal. Our calculations make it explicit that variations in the precise values of versatility, in the number of different classes of sites, or in particular constraints on structures (as, e.g. the minimum number of base pairs required to form a stack) have a quantitative effect on the parameters of the lognormal, but do not affect the shape of the distribution.
In the case we should expect a power-law-like distribution of phenotype sizes characterized by an exponent . The actual value of stems from a combination of the number of genotypes compatible with a given phenotype and the total number of phenotypes with the same (or similar) size. Variations in the functional form of with could be responsible for changes in . In a general scenario, let us assume that phenotype sizes can be ordered according to a certain variable (in our case the number of low versatility positions ), and let us define the total number of genotypes compatible with -phenotypes as , formally generalizing the quantity calculated in the specific models tackled in this work. The behaviour of with determines the value of the exponent : If is constant, then . However, if is exponentially enriched (depleted) in genotypes as grows, the value of becomes larger (smaller) than 2. In the case of Model 3, for example , with and . Two examples of enrichment or depletion in the number of genotypes compatible with -phenotypes are , with and , and , with and . In a very explicit way now, changes in the actual assignment of genotypes to phenotypes through (embedded in ) will affect the probability density distribution.
Another example in the class of Model 3, yielding power-law-like with non-trivial is the model in [33]. Besides the division of sequences into neutral and constrained sites, the authors introduce a stop codon which causes an dependent transition rate to alternative phenotypes, that being the eventual reason for a non-trivial value of . In that case, , which corresponds to a value of and, consistently, , with . The stop codon represents a particular instance of a decreased tolerance to mutations in less versatile sites. Another formal example could be a rate to lethal mutations increasing with . This class of mechanisms skew the assignation of genotypes to phenotypes or, equivalently, deplete the amount of genotypes associated to phenotypes as grows: larger values of imply that there are more positions where non-neutral mutations can occur, and this leads to and . Figure 3 summarizes the sequence organization of different models with a power-law distribution of phenotype sizes, the origin and functional form of the function, and the corresponding value.
In Fig. 4 we represent schematically the functional form of and for the class of our Model 3 and a possibly general class of models analogous to RNA (class 4). At present, it is difficult to clearly match all models in the literature to classes 3 or 4. For example, the hydrophobic-polar (HP) non-compact model seems to be characterized by a distribution of phenotype sizes similar to a power-law [42], while other models for heteropolymers that have been compared to HP yield broad distributions with a maximum [43]. Even RNA with a two-letter alphabet apparently yields power-laws [32], so it might belong to a non-trivial combination of models 3 and 4 as well. This is a very intriguing and complex question that we have to leave for future studies. These considerations notwithstanding, the situation where the combinatorial factor converges to a Gaussian distribution is expected to be very general for sequence-to-structure GP maps [39], implying that a lognormal distribution of phenotype sizes might be a generic property of such maps. Up to now, there are few quantitative results supporting this statement, very likely due to the impossibility to exhaustively fold genome spaces for large . A remarkable exception is [31], where the lognormal distribution has been suggested as the best fit to computational distributions of RNA secondary structure sizes for lengths up to . It is interesting to highlight that our results have been obtained under a uniform assignment of genotypes (represented through our variable ) to phenotypes. However, the many-to-one GP map in RNA assigns the minimum energy structure to each sequence. In the language of our function , the correlation between energy and in RNA will preferentially assign genotypes to phenotypes with a large number of pairs (large ) since, on average, the larger the number of pairs the lower the folding energy [27]. It cannot be discarded that genotype-to-phenotype assignment rules based on quantities not considered here might skew the distribution or eventually yield different functional forms. Though this is a possibility that has to be kept in mind, results in [31] reveal that, at least in the case of four-letters RNA, deviations from lognormality cannot be numerically detected. We suspect that this is likely due to a dominant effect of over both in the many-to-many and in the many-to-one representations of the RNA sequence-to-structure map.
Simple models as those presented here can be used as well to estimate other relevant quantities of GP maps, and to determine if they are almost universal or model-dependent. One such quantity is the relationship between phenotypic robustness and the size of a phenotype. In our scenario, and similarly to other examples [33, 13], phenotypic robustness coincides with genotypic robustness, which is calculated straight forward as the ratio between the number of neutral neighbours, and the total number of neighbours of a sequence, . This yields a function of . Next, is obtained easily from its relationship with , and it takes the general form , where is a model-dependent quantity. Therefore, the relationship between phenotype robustness and the logarithm of phenotype size consistently appears in very generic sequence-to-structure models. The relationship between phenotype robustness and evolvability cannot be derived unless a explicit rule linking possible mutations to phenotypes with different is introduced. In our Models 1, 2, and 3, such a rule, which could take a form analogous to the stop codon of the Fibonacci map [33], is not defined. The case of RNA is particularly interesting and has received significant computational attention since long ago [34]. Only partial explorations of the accessibility of alternative phenotypes have been performed due to the huge sizes of phenotypes [28, 11]. Hopefully, further extensions of our Model 4 could help in the analytical treatment of this highly complex problem. Advances in empirical techniques, such as the intensive use of microarrays, should allow in the near future an exhaustive characterization of actual genotype spaces, as has been done for short transcription factor binding sites [12]. We believe that analyses of empirical GP maps will reveal strengths and weaknesses of the approach here presented, and likely suggest ways of improvement, regarding in particular a formal description of phenotype networks (networks of genotype networks) and evolvability in natural systems.
Appendix A
In order to derive and for Model 2 with constrained sites split into two groups at the extremes of the sequence (Section 3.2.1) it will prove convenient to use the affine transformation of
[TABLE]
Then Eq. (9) can be rewritten
[TABLE]
from which
[TABLE]
Inversion of this equation yields
[TABLE]
with Lambert’s product-logarithm function [44, Def. 4.13.1].
Now,
[TABLE]
and using Eq. (36),
[TABLE]
Finally, since when [44, Prop. 4.13.10], when the rank is large
[TABLE]
with a constant. Then, for large we obtain Eq. (10).
As for , from Eqs. (7) and (8),
[TABLE]
Differentiating with respect to yields . Therefore, eliminating from this same equation we end up with
[TABLE]
with another constant. This is Eq. (11).
Appendix B
The rank function for the case of RNA sequences whose sites may take two values of neutrality and , a number of secondary structures of length with sites with neutrality and a total number of different secondary structures of lenght is
[TABLE]
where
[TABLE]
Now, since will be negative for all , we can use the asymptotic expansion of the complementary error function
[TABLE]
to write
[TABLE]
In order to find how the size of a phenotype depends on its rank value it is convenient to introduce new parameters. Let us denote and , and
[TABLE]
with . The size of a phenotype is given by , therefore
[TABLE]
Now, taking logarithms in (46) and neglecting subdominant terms in ,
[TABLE]
Hence
[TABLE]
and therefore
[TABLE]
which implies
[TABLE]
Author contributions
SM and JAC designed the study, carried out the calculations, interpreted the results and wrote the manuscript. Both authors read and approved the final text.
Acknowledgements
The authors acknowledge the thorough revision of three anonymous reviewers, which has helped improving this paper.
Funding statement
This work has been supported by the Spanish Ministerio de Economía y Competitividad and FEDER funds of the EU through grants ViralESS (FIS2014-57686-P) and VARIANCE (FIS2015-64349-P).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Lipman and Wilbur [1991] David J. Lipman and W. John Wilbur. Modelling neutral and selective evolution of protein folding. Proc. Roy. Soc. London B , 245(1312):7–11, 1991.
- 2Holzgräfe et al. [2011] Ch. Holzgräfe, A. Irbäck, and C. Troein. Mutation-induced fold switching among lattice proteins. J. Chem. Phys. , 135:195101, 2011.
- 3Manrubia and Sanjuán [2013] Susanna C. Manrubia and Rafael Sanjuán. Shape matters: effect of point mutations on RNA secondary structure. Adv. Compl. Syst. , 16:1250052, 2013.
- 4Weirauch and Hughes [2010] Matthew T. Weirauch and Timothy R. Hughes. Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet. , 26:66–74, 2010.
- 5Baba et al. [2006] T. Baba, T. Ara, M. Hasegawa, Y. Takai, Y. Okumura, M. Baba, K.A. Datsenko, Tomita M., B.L. Wanner, and H. Mori. Construction of Escherichia coli k-12 in-frame, single-gene knockout mutants: the keio collection. Mol. Sys. Biol. , 20:2006.0008, 2006.
- 6Rutherford [2000] S. L. Rutherford. From genotype to phenotype: buffering mechanisms and the storage of genetic information. Bioessays , 22:1095–1105, 2000.
- 7Bloom et al. [2007] J. D. Bloom, P. A. Romero, Z. Lu, and F. H. Arnold. Neutral genetic drift can alter promiscuous protein functions, potentially aiding functional evolution. Biol. Dir. , 2:17, 2007.
- 8Piatigorsky [2007] Joram Piatigorsky. Gene sharing and evolution: the diversity of protein functions . Harvard University Press, Cambridge, MA, 2007.
