Distribution of phenotype sizes in sequence-to-structure   genotype-phenotype maps

Susanna Manrubia; Jose A. Cuesta

arXiv:1702.00351·q-bio.PE·April 20, 2017

Distribution of phenotype sizes in sequence-to-structure genotype-phenotype maps

Susanna Manrubia, Jose A. Cuesta

PDF

TL;DR

This paper analytically derives the distribution of phenotype sizes in sequence-to-structure genotype-phenotype maps, revealing how different features influence the size distribution and implications for evolvability.

Contribution

It introduces models that interpolate between powerlaw and lognormal distributions, providing insights into the factors shaping phenotype size distributions.

Findings

01

Distribution of phenotype sizes varies between powerlaw and lognormal.

02

Features of the sequence-to-structure map determine the size distribution.

03

Models help understand evolvability and navigability of genotype space.

Abstract

An essential quantity to ensure evolvability of populations is the navigability of the genotype space. Navigability relies on the existence of sufficiently large genotype networks, that is ensembles of sequences with the same phenotype that guarantee an efficient random drift through sequence space. The number of sequences compatible with a given structure (e.g. the number of RNA sequences folding into a particular secondary structure, or the number of DNA sequences coding for the same protein structure) is astronomically large in all functional molecules investigated. However, an exhaustive experimental or computational study of all RNA folds or all protein structures becomes impossible even for moderately long sequences. Here, we analytically derive the distribution of phenotype sizes for a hierarchy of models which successively incorporate features of increasingly realistic…

Tables1

Table 1. Table 1: Summary of symbols used in this work and their short definitions.

Symbol	Definition
$L$	Sequence or genotype length
$k$	Alphabet size
$v (i)$	Versatility of site $i$
$ℓ$	Number of sites in the low-versatility class $v_{2}$
$L - ℓ$	Number of sites in the high-versatility class $v_{1}$
$S (ℓ)$	Size of a phenotype
$C (ℓ)$	Number of phenotypes with the same size
$N_{c} (ℓ)$	Set of $ℓ$ -genotypes
$r (ℓ)$	Rank of a phenotype
$p (S)$	Probability density that a phenotype has size $S$
$Q (L, ℓ)$	Number of phenotypes from different ordering of sites

Equations100

C (ℓ) = i ⩾ ℓ \sum C (i) - i ⩾ ℓ + 1 \sum C (i) = Pr {S ⩽ S (ℓ)} - Pr {S ⩽ S (ℓ + 1)} = \int_{S (ℓ + 1)}^{S (ℓ)} p (S) d S .

C (ℓ) = i ⩾ ℓ \sum C (i) - i ⩾ ℓ + 1 \sum C (i) = Pr {S ⩽ S (ℓ)} - Pr {S ⩽ S (ℓ + 1)} = \int_{S (ℓ + 1)}^{S (ℓ)} p (S) d S .

S (ℓ)

S (ℓ)

C (ℓ)

r (ℓ)

ℓ = lo g_{k} [(k - 1) r + 1] \approx lo g_{k} [(k - 1) r],

ℓ = lo g_{k} [(k - 1) r + 1] \approx lo g_{k} [(k - 1) r],

S (r) \approx Ω \frac{k ^{L}}{k - 1} r^{- 1} .

S (r) \approx Ω \frac{k ^{L}}{k - 1} r^{- 1} .

p (S) \propto S^{- 2} .

p (S) \propto S^{- 2} .

S (ℓ)

S (ℓ)

C (ℓ)

r (ℓ)

S (r) \propto \frac{lo g r + a}{r},

S (r) \propto \frac{lo g r + a}{r},

p (S) \propto \frac{lo g S + b}{S ^{2}},

p (S) \propto \frac{lo g S + b}{S ^{2}},

S (ℓ)

S (ℓ)

C (ℓ)

(ℓ L) \sim 2^{L} \frac{2}{π L} exp {- \frac{2}{L} (ℓ - \frac{L}{2})^{2}} .

(ℓ L) \sim 2^{L} \frac{2}{π L} exp {- \frac{2}{L} (ℓ - \frac{L}{2})^{2}} .

p (S) \propto \frac{1}{S ^{2}} exp {- \frac{2}{L} (lo g_{k} S - \frac{L}{2} - lo g_{k} Ω)^{2}},

p (S) \propto \frac{1}{S ^{2}} exp {- \frac{2}{L} (lo g_{k} S - \frac{L}{2} - lo g_{k} Ω)^{2}},

p (S) \sim \frac{1}{S lo g k} \frac{2}{π L} exp {- \frac{2}{L} [lo g_{k} S - \frac{L}{2} (1 - \frac{lo g k}{2}) - lo g_{k} Ω]^{2}}

p (S) \sim \frac{1}{S lo g k} \frac{2}{π L} exp {- \frac{2}{L} [lo g_{k} S - \frac{L}{2} (1 - \frac{lo g k}{2}) - lo g_{k} Ω]^{2}}

S (ℓ)

S (ℓ)

C (ℓ)

r (ℓ)

ℓ = lo g_{κ} (1 + \frac{κ - 1}{( k - v _{1} + 1 ) ^{L}} r),

ℓ = lo g_{κ} (1 + \frac{κ - 1}{( k - v _{1} + 1 ) ^{L}} r),

S (r) = v_{1}^{L} (\frac{v _{1}}{v _{2}})^{- l o g_{κ} (1 + \frac{κ - 1}{( k - v _{1} + 1 ) ^{L}} r)} = v_{1}^{L} (1 + \frac{κ - 1}{( k - v _{1} + 1 ) ^{L}} r)^{- l o g_{κ} (v_{1} / v_{2})} .

S (r) = v_{1}^{L} (\frac{v _{1}}{v _{2}})^{- l o g_{κ} (1 + \frac{κ - 1}{( k - v _{1} + 1 ) ^{L}} r)} = v_{1}^{L} (1 + \frac{κ - 1}{( k - v _{1} + 1 ) ^{L}} r)^{- l o g_{κ} (v_{1} / v_{2})} .

α = lo g_{κ} (\frac{v _{1}}{v _{2}}),

α = lo g_{κ} (\frac{v _{1}}{v _{2}}),

ℓ = - \frac{1}{α} lo g_{κ} (\frac{S}{v _{1}^{L} Ω}),

ℓ = - \frac{1}{α} lo g_{κ} (\frac{S}{v _{1}^{L} Ω}),

p (S) \propto κ^{- \frac{1}{α} l o g_{κ} S} S^{- 1} \propto S^{- 1 - α^{- 1}} .

p (S) \propto κ^{- \frac{1}{α} l o g_{κ} S} S^{- 1} \propto S^{- 1 - α^{- 1}} .

N_{c} (ℓ) = Ω v_{1}^{L} (k - v_{1} - 1)^{L} (\frac{v _{2}}{v _{1}} κ)^{ℓ} .

N_{c} (ℓ) = Ω v_{1}^{L} (k - v_{1} - 1)^{L} (\frac{v _{2}}{v _{1}} κ)^{ℓ} .

Q (L, ℓ) \sim \frac{1}{2 π σ _{L}} e^{- (ℓ - μ_{L})^{2} /2 σ_{L}^{2}} Q_{L},

Q (L, ℓ) \sim \frac{1}{2 π σ _{L}} e^{- (ℓ - μ_{L})^{2} /2 σ_{L}^{2}} Q_{L},

ℓ = \frac{L lo g v _{1} - lo g S}{2 lo g ( \frac{v _{1}}{v _{2}} )} .

ℓ = \frac{L lo g v _{1} - lo g S}{2 lo g ( \frac{v _{1}}{v _{2}} )} .

μ_{S} = L lo g v_{1} - μ_{L}, σ_{S} = 2 lo g (\frac{v _{1}}{v _{2}}) σ_{L},

μ_{S} = L lo g v_{1} - μ_{L}, σ_{S} = 2 lo g (\frac{v _{1}}{v _{2}}) σ_{L},

p_{L} (S) \sim \frac{1}{2 π σ _{S} S} e^{- (l o g S - μ_{S})^{2} /2 σ_{S}^{2}} .

p_{L} (S) \sim \frac{1}{2 π σ _{S} S} e^{- (l o g S - μ_{S})^{2} /2 σ_{S}^{2}} .

C (ℓ) \sim (k - v_{2} + 1)^{2 ℓ} (k - v_{1} + 1)^{L - 2 ℓ} Q (L, ℓ) .

C (ℓ) \sim (k - v_{2} + 1)^{2 ℓ} (k - v_{1} + 1)^{L - 2 ℓ} Q (L, ℓ) .

S (r) \sim v_{1}^{L (1 - 2 a)} v_{2}^{2 a L} exp {η L 1 - \frac{lo g r}{c L}}, η \equiv σ 8 c lo g (\frac{v _{1}}{v _{2}}),

S (r) \sim v_{1}^{L (1 - 2 a)} v_{2}^{2 a L} exp {η L 1 - \frac{lo g r}{c L}}, η \equiv σ 8 c lo g (\frac{v _{1}}{v _{2}}),

S ({v (i)}) = i \prod v (i) .

S ({v (i)}) = i \prod v (i) .

C ({v (i)}) = Q (L, {v (i)}) i \prod (k - v (i) + 1),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Distribution of genotype network sizes in sequence-to-structure

genotype-phenotype maps

Susanna Manrubia

Grupo Interdisciplinar de Sistemas Complejos (GISC), Madrid

Dept. de Biología de Sistemas, Centro Nacional de Biotecnología (CSIC), Madrid, Spain

José A. Cuesta

Grupo Interdisciplinar de Sistemas Complejos (GISC), Madrid

Dept. de Matemáticas, Universidad Carlos III de Madrid, Leganés, Madrid, Spain

Instituto de Biocomputación y Física de Sistemas Complejos (BIFI)

Universidad de Zaragoza, Spain

UC3M-BS Institute of Financial Big Data (IFiBiD), Universidad Carlos III de Madrid, Getafe, Madrid, Spain

Abstract

An essential quantity to ensure evolvability of populations is the navigability of the genotype space. Navigability, understood as the ease with which alternative phenotypes are reached, relies on the existence of sufficiently large and mutually attainable genotype networks. The size of genotype networks (e.g. the number of RNA sequences folding into a particular secondary structure, or the number of DNA sequences coding for the same protein structure) is astronomically large in all functional molecules investigated: an exhaustive experimental or computational study of all RNA folds or all protein structures becomes impossible even for moderately long sequences. Here, we analytically derive the distribution of genotype network sizes for a hierarchy of models which successively incorporate features of increasingly realistic sequence-to-structure genotype-phenotype maps. The main feature of these models relies on the characterization of each phenotype through a prototypical sequence whose sites admit a variable fraction of letters of the alphabet. Our models interpolate between two limit distributions: a power-law distribution, when the ordering of sites in the prototypical sequence is strongly constrained, and a lognormal distribution, as suggested for RNA, when different orderings of the same set of sites yield different phenotypes. Our main result is the qualitative and quantitative identification of those features of sequence-to-structure maps that lead to different distributions of genotype network sizes.

Keywords: genotype-phenotype map, neutrality, RNA, phenotype size, evolution

1 Introduction

How genotypes map into phenotypes counts amongst the most essential questions to understand how evolutionary innovations might come about and how evolutionarily stable strategies are fixed in populations. With some of its features seemingly dependent on the system studied and on the description level considered, the genotype-phenotype (GP) map appears far from trivial. Many studies have addressed the effect of mutations on phenotype: point mutations [1, 2, 3], genome fragment deletion [4], duplication or inversions, or the knockout of specific genes [5] —among others— may or may not have an effect at the molecular, metabolic, regulatory, or organismal level [6]. Also, the ability of genotypes to yield more than one phenotype is a main resource of molecular adaptation [7, 8]. The probability of expressing different phenotypes or of experiencing mutations that modify the current phenotype depends on the structure of the GP map, which eventually determines how the space of function is explored, and what are the chances that a population survives or innovates in the face of endogenous or exogenous changes [9, 10, 11, 12, 13, 14].

Most models are restricted to the many-to-one realisation of the GP map, and thus assume that adaptation is dominated by mutations. There is a plethora of different model systems studied under this assumption. Despite seemingly relevant underlying molecular differences, those models present a remarkable number of common properties. Exhaustive research on the GP map was pioneered by studies of RNA sequence-to-secondary-structure mappings. Most topological properties identified in RNA spaces are shared by other simple systems, such as the existence of huge genotype networks, the increase in phenotype robustness with the size of the latter, and a very skewed distribution of network sizes. The set of genotypes that yield the same phenotype typically forms a network, since those genotypes are pairwise connected through mutations. Sufficiently large genotype networks so defined were postulated as a condition for the navigability of sequence space long ago [15]. Subsequent studies have shown that such large networks do exist, and that the difference in sequence between genotypes in those networks can be as large as the difference between two random sequences [16, 17, 18, 2]. Phenotype robustness refers to the average effect of point mutations in the genotypes of a specific genotype network. It has been shown to grow logarithmically with the size of the phenotype in RNA [19], in a self-assembly model of protein quaternary structure [20] and in simple models for protein folding [13]. The existence of qualitative and quantitative statistical properties of the GP maps shared by apparently dissimilar systems suggests that they might arise from basic universal features [21, 13]. Though genotype networks are not always fully connected, they do traverse the whole space of genotypes for sufficiently abundant phenotypes, thus ensuring high navigability [22, 23]. Even in cases where genotype networks are fragmented, those fragments could be mutually reached if the GP map is many-to-many. The existence of “promiscuous” sequences that map into more than one phenotype enhances navigability and promotes fast adaptation [7, 14].

The statistical property of GP maps that has attracted the most attention is very likely the distribution of genotype network sizes, or phenotype sizes for short. Due to the astronomically large sizes of genotype spaces, initial estimations of the size of phenotypes were performed through random samplings of genotype space. The results were often represented as frequency-rank plots, with phenotypes ordered according to their sizes. Random samplings of genotype spaces in many-to-one GP maps invariably yielded some very abundant phenotypes and a large number of phenotypes represented by a few or just one genotype [24, 25]. Often, a frequency-rank plot was fitted to a generalized Zipf’s law [26], implying a power-law-like distribution of phenotype sizes. However, subsequent studies demonstrated that the frequency-rank plot of phenotype sizes actually had a more complex functional shape [27, 28, 29, 30], and specific functional fits were avoided. Subsequent studies have exhaustively mapped the complete sequence space to its corresponding phenotypes, among which RNA sequence-to-minimum energy secondary structure map [28, 31], the hydrophobic-polar (HP) model for protein folding [29, 2], or toyLIFE, which includes a sequence-to-structure-to-function description [30]. As a result, complete phenotype size distributions (for short sequences) are now available. Fitted shapes range from power-law-like curves [32] to lognormal distributions [31].

It has been argued that, among other generic properties, a skewed distribution of phenotype sizes results from the organization of biological sequences into constrained and unconstrained parts. In [33], the authors introduce the Fibonacci GP map, a many-to-one artificial model, where sites in a sequence can be coding or non-coding, and either lead to new phenotypes under mutations (coding sites) or yield the same phenotype (neutral, non-coding sites). The model can be analytically solved and yields a power-law phenotype size distribution, in qualitative agreement with some observations.

In this contribution, we attempt an identification of the elements in the organization of sequences that characterize the quantitative properties of the distribution of phenotype sizes. We show in a constructive fashion that the model in [33] is an example of a broad spectrum of sequence-to-structure GP models. Starting with the simplest case, where sequences are separated into constrained and neutral parts, and adding subsequent elements in the organization of the sequences and versatility levels of the sites, we show how the distribution of phenotype sizes changes from pure power-law (with an exponent dependent on how genotypes are distributed among phenotypes) to lognormal. This functional form is independent of whether the GP map is many-to-many (sequences are promiscuous) or many-to-one (the phenotype can be uniquely predicted from the sequence). Our final example corresponds to the RNA sequence-to-secondary structure map, where we demonstrate that the combinatorial properties of the distribution of sites of variable neutrality along sequences causes the distribution of phenotypes to follow a lognormal distribution, with parameters that can be traced to properties of the genotype set. Our main result is that a lognormal distribution of phenotype sizes is the expected result in any GP map where sufficient variation in the number of phenotypes of similar size is present.

2 Definitions

We will study four models that interpolate between the simplest case of sequences divided into neutral and non-neutral sites separated into two groups and a general case (represented by RNA), and calculate for each of them the size of a phenotype given the sequence organization of its corresponding genotypes, the number of phenotypes with the same size, the frequency rank ordering of phenotypes, and eventually the distribution of phenotype sizes. Table 1 summarizes the nomenclature and definitions used in this work, and Figure 1 illustrates some relevant quantities.

The genotype space is made of sequences of length $L$ letters from an alphabet of size $k$ . Two examples of alphabet sizes are $k=2$ for a binary alphabet $\{0,1\}$ and $k=4$ for DNA or RNA, $\{$ A, C, G, T or U $\}$ . The versatility $v(i)$ of site $i$ is defined as the average number of different letters of the alphabet that can occupy a given sequence position $i$ . In general $k\geqslant v(i)\geqslant 1$ for all sites $i$ . This is a quantity closely related to neutrality. We will study the simplified case where sites can take one out of two different values, $v(i)\in\{v_{1},v_{2}\}$ , with $k\geqslant v_{1}>v_{2}\geqslant 1$ . Sites are called constrained if $v_{2}=1$ , and neutral if $v_{1}=k$ . We will use $\ell$ to count the number of sites with low versatility.

The size $S(\ell)$ of a phenotype is the number of different genotypes compatible with that phenotype. From the definition of $\ell$ it follows that $S(\ell)$ is a nonincreasing function of $\ell$ . In the literature, phenotype frequency [33], number of sequences for a phenotype [32] or neutral set size [31] have been used with a meaning identical to phenotype size here. The set of $\ell$ -genotypes is defined as the number of genotypes compatible with $\ell$ -phenotypes, $N_{c}(\ell)\equiv S(\ell)C(\ell)$ . The rank of the first phenotype in size class $C(\ell)$ is $r(\ell)=\sum_{i=0}^{\ell-1}C(i)$ . Note that the total number of phenotypes coincides with the maximum rank.

If $p(S)$ is the probability density that a phenotype has size $S$ , then we can count phenotypes as

[TABLE]

To first order we can approximate the integral as $C(\ell)\approx p(S)|S^{\prime}(\ell)|$ (the approximation gets better the smaller $S^{\prime}(\ell)$ ). Thus, up to a normalisation constant, $p(S)\propto C\big{(}\ell(S)\big{)}\left|S^{\prime}\big{(}\ell(S)\big{)}\right|^{-1}$ .

The probability density $p(S)$ yields the probability of finding a phenotype with size $S$ when uniformly sampling over phenotypes. This corresponds to the distribution $P_{P}(S)$ , as defined in other studies [31].

Finally, we will also introduce a factor $w(\ell)$ to represent the fraction of $\ell$ -genotypes that actually go to a given $\ell$ -phenotype. This factor arises from additional restrictions in the assignment of genotypes to phenotypes which are not made explicit in the models. In general, if $w(\ell)=1$ the models we are going to introduce assign the same genotypes to several $\ell$ -phenotypes. This would correspond to a many-to-many GP map —a sort of maps suitable to describe molecular promiscuity. Incidentally, molecular promiscuity strongly enhances navigability in genotype space [7, 8, 14]. Other choices may account for specific restrictions in the models; in particular, a suitable choice of $w(\ell)$ may render the GP map many-to-one. We will return to this point when we provide details of the models.

A succint definition of the hierarchy of models introduced in this work is as follows:

•

Model 1: Constrained and neutral sites occupy fixed positions. Sequences are separated in two parts, the first one of length $\ell$ occupied by constrained sites, $v_{2}=1$ , and the second part of length $L-\ell$ occupied by neutral sites, $v_{1}=k$ . Two minor variants considered are (i) phenotypes are all viable and (ii) lethal mutations occur independently of the site class.

•

Model 2: Constrained and neutral sites occupy variable positions. This is illustrated by means of two examples: (i) constrained sites are split into two fragments at the beginning and at the end of the sequence and (ii) constrained sites can occupy arbitrary positions in the sequence.

•

Model 3: Versatile sites occupy fixed positions. Two different types of sites with fixed versatilities $v_{1}$ and $v_{2}$ are considered.

•

Model 4: Versatile sites occupy variable positions: RNA. In a first approximation, RNA sequences contain two types of sites that occupy different positions in the sequence subject to secondary structure constraints: those forming pairs (stacks) in the secondary structure have average versatility $v_{2}$ , and those unpaired (loops) have average versatility $v_{1}$ . The model can be generalized to an arbitrary number of site classes.

Figure 2 schematically represents the different models analysed here and some properties that will be of relevance to understand the distributions of phenotype sizes they yield.

3 Results

3.1 Model 1: Constrained and neutral sites occupy fixed positions

This is probably the simplest non-trivial model in the class of GP maps, very similar in spirit to that presented in [33]. Phenotypes are characterized by $\ell$ constrained sites in the first part of the sequence. For a fixed $\ell$ , mutations in a constrained site change the phenotype, and mutations in neutral sites yield genotypes compatible with the phenotype. Therefore,

[TABLE]

Note that, if $w(\ell)=1$ , the complete genotype space is partitioned among $\ell$ -phenotypes for every value of $\ell$ . This implies that, if we consider all possible phenotypes (i.e. all $\ell$ values), a particular genotype is simultaneously compatible with many different phenotypes —representing a highly promiscuous sequence. Specifically, if $w(\ell)=1$ the total number of genotypes compatible with $\ell$ -phenotypes is $N_{c}(\ell)=k^{L}$ , so the total amount of genotypes $\sum_{\ell}N_{c}(\ell)=(L+1)k^{L}$ . This result clearly shows the many-to-many nature of the GP map of this model with this choice of $w(\ell)$ —genotypes are assigned to all phenotypes they are compatible with and, therefore, are repeatedly counted.

A minimal rule to avoid multiple assignments is to think of $w(\ell)$ as the probability that a genotype is actually assigned to an $\ell-$ phenotype. When this probability is uniform, $w(\ell)=\Omega$ , and we choose $\Omega=(L+1)^{-1}$ , the total number of genotypes becomes $\sum_{\ell}N_{c}(\ell)=k^{L}$ , the size of the genotype space, so the resulting map is effectively many-to-one. Other examples in which $w(\ell)$ depends on $\ell$ will appear later.

Now, to obtain size as a function of rank we must eliminate $\ell$ in $r(\ell)$ and substite it into $S(\ell)$ to get $S(r)$ . In this case, from Eq. (3) and assuming $(k-1)r\gg 1$ ,

[TABLE]

and substituting in (1)

[TABLE]

To obtain the probability density $p(S)$ we first notice that Eq. (1) implies $k^{\ell}=\Omega k^{L}S^{-1}$ , hence $C(S)=\Omega k^{L}S^{-1}$ . On the other hand $S^{\prime}(\ell)=-(\log k)S$ , thus

[TABLE]

Hence the probability distribution is a power-law with exponent $\beta=2$ .

3.1.1 Non-viable genotypes arise from uniformly distributed lethal

mutations

In the same scenario as above, let us assume that a fraction $\delta$ of mutations is lethal, thus leading to a non-viable genotype. In this case, Eqs. (1) to (3) are identical, with $k$ substituted by $k(1-\delta)$ . Therefore, $S(r)$ and $p(S)$ are as above with the latter change. This result shows that the existence of a non-viable class to which viable genotypes can mutate does not necessarily imply relevant functional changes in the distribution of phenotypes, which is in either case of the form $p(S)\sim S^{-\beta}$ , with $\beta=2$ . The effect of uniformly distributed lethal mutations could be therefore absorbed as a constant into $\Omega$ . The situation changes if mutations are not distributed uniformly, but their likelihood depends on $\ell$ . This would be a particular realisation of Model 3 introduced below.

3.2 Model 2: Constrained and neutral sites occupy variable positions

In any realistic model (e.g. the case of RNA) the position of constrained and neutral sites should matter in the definition of a phenotype. While $S(\ell)$ does not change its functional form as a result, $C(\ell)$ does (and $r(\ell)$ as a consequence), causing potentially relevant modifications in $S(r)$ and $p(S)$ . In general, the number of different phenotypes would take the form $C(\ell)=k^{\ell}Q(L,\ell)$ , where $k^{\ell}$ accounts for changes in the letter of the constrained site (yielding a different phenotype, as assumed) and $Q(L,\ell)$ is a model-dependent combinatorial number that counts the different ways in which the $\ell$ sites can be arranged to yield meaningful (and different) phenotypes. In general, the factor $S^{-2}$ in $p(S)$ stems from mutations in neutral sites, while the arrangement of constrained and neutral sites along the sequence is weighted by $Q\big{(}L,\ell(S)\big{)}$ , with effects on the functional form of $p(S)$ that, in general, depend on the permitted arrangements. As will be shown, $Q(L,\ell)$ might enormously increase the number of phenotypes and, especially, the relative abundances of $\ell$ -phenotypes.

3.2.1 Constrained sites are split into two groups at the extremes of

the sequence

As a way of example, let us consider one of the simplest situations where the position of the constrained sites matters. Suppose that those sites can be split into two groups with lengths $\ell_{1}$ and $\ell_{2}$ and placed at the beginning and at the end of the sequence (such that $0\leqslant\ell_{1},\ell_{2}\leqslant L$ and $\ell_{1}+\ell_{2}=\ell$ ). This gives $Q(L,\ell)=\ell+1$ different phenotypes with $\ell$ constrained sites, and

[TABLE]

From these expressions we can obtain (see Appendix A) the asymptotic (for large $r$ ) rank distribution

[TABLE]

and the size probability density

[TABLE]

with $a$ and $b$ some constants.

Therefore, even in this simple case with quite a limited number of possible organization of constrained sites, $S(r)$ and $p(S)$ are no longer pure power-laws, though the dominant term of the phenotype size distribution (size still dominated by mutations in neutral sites) is characterized by an exponent $\beta=2$ . The total number of genotypes compatible with $\ell$ -phenotypes is also modified, $N_{c}(\ell)=k^{L}(\ell+1)$ , and is seen to increase linearly with $\ell$ .

3.2.2 Constrained sites can occupy any position in the sequence

We now assume that the constrained and unconstrained sites can occupy any site of the chain. In that case

[TABLE]

with no simple expression for $r(\ell)$ . Let us focus, however, on the size distribution $p(S)$ , and consider the case where $L\gg 1$ . Asymptotically for $L\to\infty$

[TABLE]

Changing $\ell$ to $S$ through $\ell=L+\log_{k}\Omega-\log_{k}S$

[TABLE]

and writing $S^{-1}=\exp(-\log k\,\log_{k}S)$ , we finally obtain

[TABLE]

a log-normal distribution with mean $\mu_{L}\sim\frac{\log k}{2}\left(1-\frac{\log k}{2}\right)L+\log\Omega$ and variance $\sigma_{L}^{2}\sim\left(\frac{\log k}{2}\right)^{2}L$ , very different from the $p(S)\sim S^{-2}$ distribution of the previous cases.

This section presents an example of a main result of this study. It shows that, when the definition of the phenotype depends on the specific position of constrained and neutral sites in sequences, the functional form of $p(S)$ (and, in consequence, of $S(r)$ ) qualitatively changes. In particular, the exponential growth of $Q(L,\ell)$ with $L$ dominates $p(S)$ , which takes the form of a lognormal distribution. Other quantities defining the GP map, such as $k$ or $\Omega$ , change now the parameters of the distribution, but do not modify its shape.

3.3 Model 3: Versatile sites occupy fixed positions

The models analysed above demonstrate that when sites are either constrained or neutral, the exponent associated to the power-law part of $p(S)$ is $\beta=2$ . As we show next, this exponent is modified when the sites in the sequence show intermediate degrees of versatility, which causes the number of $\ell$ -genotypes to depend on $\ell$ .

Let us consider the case where the $L-\ell$ sites are just less constrained than the $\ell$ sites, such that the former admit an average of $v_{1}$ different letters of the alphabet and the latter admit $v_{2}$ , with $k\geqslant v_{1}>v_{2}\geqslant 1$ . Relevant functions read

[TABLE]

with $\kappa\equiv(k-v_{2}+1)/(k-v_{1}+1)$ .

As it can be readily seen by substitution, these expressions reduce to Model 1 for $v_{1}=k$ and $v_{2}=1$ . Now,

[TABLE]

yielding

[TABLE]

For large $r$ this scales as $S(r)\sim cr^{-\alpha}$ , where $\alpha$ depends on $v_{1}$ and $v_{2}$ as

[TABLE]

yielding $\alpha=1$ in the limit of Model 1. Substituting this expression into Eq. (17),

[TABLE]

hence, up to a constant factor,

[TABLE]

Again $p(S)$ maintains its power-law shape but its exponent depends on $v_{1}$ and $v_{2}$ .

The number of $\ell$ -genotypes now becomes

[TABLE]

This number can either increase or decrease with $\ell$ depending on whether $v_{2}/v_{1}\kappa$ is larger or smaller than 1. Both situations are possible under the constraint $v_{1}>v_{2}$ . The values of $\alpha$ and $\beta$ change in response to possible enrichements or depletions in the total number of assigned genotypes with $\ell$ . This is a first example of similar cases encountered later in this work and in the literature, as we discuss later.

3.4 Model 4: Versatile sites occupy variable positions: RNA

In a first approximation (which has been shown to yield acceptable fits to data [19]), RNA sequences can be divided into two classes of sites: those in stacks (bound) and those in loops (unbound), characterized by different degrees of neutrality (see e.g. [34] and Fig. 4 in [35]). Changes in the position of loops and stacks means a different phenotype. Additionally, the composition of each site in the sequence bears a significant correlation with the structural element it will preferentially represent in the phenotype (see Fig. 7 in [36]). Therefore, a first approximation to a GP representation of RNA involves elements in our previous Models 2 and 3. In the following, the abundances or phenotypes will be ruled by (averaged) values $v_{2}$ and $v_{1}$ of the number of letters that can be changed in stacks or loops, respectively (see Fig. 2), without affecting the phenotype.

Studies of RNA neutral networks and their related properties are usually restricted to the many-to-one mapping between sequence and structure. Despite the fact that any RNA sequence is compatible with multiple structures whose relative weight in an ensemble of identical sequences is defined by their folding energy [37], it is common practice to select only the minimum energy fold as the associated phenotype. This decision transforms an intrinsic many-to-many GP map where alternative phenotypes can be reached through mutations or promiscuity, into a many-to-one map where navigability is limited to the effects of neutral drift. Analytical approaches cannot include, in general, energetic considerations, so they implicitly work in the many-to-many unrestricted case. This situation is comparable to the assignation of sequences to structures we have performed in our models, where every sequence is assigned to all phenotypes it is compatible with, while possible restrictions in the assignments are encompassed in $w(\ell)$ . The distribution of secondary structure sizes for the unrestricted map (i.e. all sequences compatible with a given secondary structure) fixing the number of stacks or loops has been derived in [38] for the general case of structures with pseudoknots, in [39] and [40], and in [41] in a form that will be used here.

3.4.1 Number of secondary structures with fixed number of pairs in

RNA

In this case $\ell$ will denote the number of pairs of nucleotides in stacks ( $\ell=1,2,\dots,(L-j)/2$ , with $j=3$ if $L$ is odd and $j=4$ if $L$ is even), hence $L-2\ell$ will be the number of nucleotides in loops ( $L-2\ell\geqslant 3$ , which is the size of the minimal —hairpin— loop); $p_{L,\ell}$ is the probability distribution for secondary structures with $2\ell$ paired nucleotides, for sequences of length $L$ (in the limit $L,\ell\to\infty$ ). It has been shown [38, 39, 41] that this distribution behaves as a normal distribution in $\ell$ with mean $\mu_{L}=\mu L+\mu_{0}+O\left(L^{-1}\right)$ and standard deviation $\sigma_{L}=\sigma L^{1/2}+\sigma_{0}L^{-1/2}+O\left(L^{-3/2}\right)$ . In the case that structures with stems with less than two base pairs or loops with less than three unpaired bases are forbidden —accounting for minimal energetic constraints— we obtain $\mu\approx 0.28647\dots$ , $\mu_{0}\approx-1.36502\dots$ , $\sigma\approx 0.25510\dots$ , and $\sigma_{0}\approx-0.00713\dots$ Note that different constraints will lead to different values of these quantities, but otherwise will not change the fact that $p_{L,\ell}$ is a normal distribution. Finally, the number $Q(L,\ell)$ of different phenotypes of a sequence of length $L$ with $2\ell$ paired bases is given, in the limit $L,\ell\to\infty$ , by

[TABLE]

with $Q_{L}\sim 1.48L^{-3/2}(1.85)^{L}$ (see [38, 39, 40, 41]).

3.4.2 Size distribution

In the case that the unpaired sites admit $v_{1}$ different letters and the paired sites $v_{2}$ letters ( $1\leqslant v_{2}<v_{1}\leqslant k$ ), the size of a phenotype is given by $S(\ell)=v_{1}^{L-2\ell}v_{2}^{2\ell}$ . Here, we will consider that a phenotype is formed by all sequences compatible with that phenotype, thus setting $\Omega=1$ . We have

[TABLE]

Denoting

[TABLE]

and noting that $p_{L}(S)=Q(L,\ell)/Q_{L}$ , substitution of (27) into (26) yields the log-normal distribution

[TABLE]

3.4.3 Rank distribution

In the same two-sites approximation

[TABLE]

The functional form of the rank $r(\ell)$ is derived in Appendix B. After some algebra we arrive at

[TABLE]

with constants $a$ and $c$ depending on parameters of the combinatorial factor $Q(L,\ell)$ , see Appendix B.

4 Discussion

The functional shape of the distribution of phenotype sizes is strongly dependent on the sequence organization within phenotypes. In a first approximation that discards the heterogeneity among genotypes in the same phenotype, one may describe that ensemble of sequences through a prototypic sequence whose sites admit a phenotype-dependent, variable number of letters of the alphabet, a quantity that we have dubbed versatility. The substitution of each sequence in a phenotype by the average over the phenotype seems a strong approximation. However, there is evidence that deviations from the average within a phenotype are small: the number of neutral neighbours of genotypes within a phenotype are tightly clustered around an average value characteristic of that phenotype size [19]. With this proviso, two main elements determine the corresponding distribution of phenotype sizes. The first one, generic for all systems, is the relationship between the size of a phenotype and the versatility $v(i)$ of each site $i$ . In the framework used in this work, the size of a phenotype can be written in general as

[TABLE]

This product yields an intrinsic allometric relation between the size of a phenotype and the length of the sequence. The second element, specific of each sequence-to-structure map, is the number of phenotypes with similar size. This quantity takes the overall form

[TABLE]

with the combinatorial factor accounting for the number of ways in which an ensemble of $L$ sites with $v(i)$ values can be arranged into meaningful phenotypes, and the product accounting for the number of neutral sequences within the phenotype. If the values of the combinatorial factor are constrained enough such that the asymptotic behavior of $Q(L,\{v(i)\})$ with $L$ is subdominant with respect to that of the product —as in Models 1 and 3— the distribution of phenotype sizes is a power-law. If, on the contrary, the dominant term is the combinatorial factor —in particular when the distribution of structural motifs converges to a Gaussian— the distribution of phenotype sizes becomes a lognormal. Our calculations make it explicit that variations in the precise values of versatility, in the number of different classes of sites, or in particular constraints on structures (as, e.g. the minimum number of base pairs required to form a stack) have a quantitative effect on the parameters of the lognormal, but do not affect the shape of the distribution.

In the case $Q(L,\{v(i)\})\simeq 1$ we should expect a power-law-like distribution of phenotype sizes characterized by an exponent $\beta$ . The actual value of $\beta$ stems from a combination of the number of genotypes compatible with a given phenotype and the total number of phenotypes with the same (or similar) size. Variations in the functional form of $w(\ell)$ with $\ell$ could be responsible for changes in $\beta$ . In a general scenario, let us assume that phenotype sizes can be ordered according to a certain variable $\lambda$ (in our case the number of low versatility positions $\ell$ ), and let us define the total number of genotypes compatible with $\lambda$ -phenotypes as $N_{c}(\lambda)\equiv S(\lambda)C(\lambda)$ , formally generalizing the quantity calculated in the specific models tackled in this work. The behaviour of $N_{c}(\lambda)$ with $\lambda$ determines the value of the exponent $\beta$ : If $N_{c}(\lambda)$ is constant, then $\beta=2$ . However, if $N_{c}(\lambda)$ is exponentially enriched (depleted) in genotypes as $\lambda$ grows, the value of $\beta$ becomes larger (smaller) than 2. In the case of Model 3, for example $N_{c}(\ell)=AB^{\ell}$ , with $B=(v_{2}/v_{1})(k-v_{2}+1)/(k-v_{1}+1)$ and $\beta=1+1/\alpha$ . Two examples of enrichment or depletion in the number of genotypes compatible with $\ell$ -phenotypes are $\{v_{1},v_{2}\}=\{4,2.5\}$ , with $B=1.56$ and $\beta=2.95$ , and $\{v_{1},v_{2}\}=\{3,1.5\}$ , with $B=0.875$ and $\beta=1.81$ . In a very explicit way now, changes in the actual assignment of genotypes to phenotypes through $w(\lambda)$ (embedded in $S(\lambda)$ ) will affect the probability density distribution.

Another example in the class of Model 3, yielding power-law-like $p(S)$ with non-trivial $\beta$ is the model in [33]. Besides the division of sequences into neutral and constrained sites, the authors introduce a stop codon which causes an $\ell-$ dependent transition rate to alternative phenotypes, that being the eventual reason for a non-trivial value of $\beta$ . In that case, $N_{c}(\ell)\approx 2^{L-\ell}\phi^{\ell-1}/\sqrt{5}$ , which corresponds to a value of $B=0.81$ and, consistently, $2>\beta=1.69$ , with $w(\ell)=\phi^{\ell-1}/(2^{\ell}\sqrt{5})$ . The stop codon represents a particular instance of a decreased tolerance to mutations in less versatile sites. Another formal example could be a rate to lethal mutations increasing with $\ell$ . This class of mechanisms skew the assignation of genotypes to phenotypes or, equivalently, deplete the amount of genotypes associated to phenotypes as $\ell$ grows: larger values of $\ell$ imply that there are more positions where non-neutral mutations can occur, and this leads to $\alpha>1$ and $\beta<2$ . Figure 3 summarizes the sequence organization of different models with a power-law distribution of phenotype sizes, the origin and functional form of the $N_{c}(\ell)$ function, and the corresponding $\beta$ value.

In Fig. 4 we represent schematically the functional form of $S(r)$ and $p(S)$ for the class of our Model 3 and a possibly general class of models analogous to RNA (class 4). At present, it is difficult to clearly match all models in the literature to classes 3 or 4. For example, the hydrophobic-polar (HP) non-compact model seems to be characterized by a distribution of phenotype sizes similar to a power-law [42], while other models for heteropolymers that have been compared to HP yield broad distributions with a maximum [43]. Even RNA with a two-letter alphabet apparently yields power-laws [32], so it might belong to a non-trivial combination of models 3 and 4 as well. This is a very intriguing and complex question that we have to leave for future studies. These considerations notwithstanding, the situation where the combinatorial factor converges to a Gaussian distribution is expected to be very general for sequence-to-structure GP maps [39], implying that a lognormal distribution of phenotype sizes might be a generic property of such maps. Up to now, there are few quantitative results supporting this statement, very likely due to the impossibility to exhaustively fold genome spaces for large $L$ . A remarkable exception is [31], where the lognormal distribution has been suggested as the best fit to computational distributions of RNA secondary structure sizes for lengths up to $L=126$ . It is interesting to highlight that our results have been obtained under a uniform assignment of genotypes (represented through our variable $\Omega$ ) to phenotypes. However, the many-to-one GP map in RNA assigns the minimum energy structure to each sequence. In the language of our function $w(\ell)$ , the correlation between energy and $\ell$ in RNA will preferentially assign genotypes to phenotypes with a large number of pairs (large $\ell$ ) since, on average, the larger the number of pairs the lower the folding energy [27]. It cannot be discarded that genotype-to-phenotype assignment rules based on quantities not considered here might skew the distribution or eventually yield different functional forms. Though this is a possibility that has to be kept in mind, results in [31] reveal that, at least in the case of four-letters RNA, deviations from lognormality cannot be numerically detected. We suspect that this is likely due to a dominant effect of $Q(L,\ell)$ over $w(\ell)$ both in the many-to-many and in the many-to-one representations of the RNA sequence-to-structure map.

Simple models as those presented here can be used as well to estimate other relevant quantities of GP maps, and to determine if they are almost universal or model-dependent. One such quantity is the relationship between phenotypic robustness and the size of a phenotype. In our scenario, and similarly to other examples [33, 13], phenotypic robustness coincides with genotypic robustness, which is calculated straight forward as the ratio between the number of neutral neighbours, $(\nu_{1}-1)(L-\ell)+(\nu_{2}-1)\ell$ and the total number of neighbours of a sequence, $L(k-1)$ . This yields a function of $\ell/L$ . Next, $\ell$ is obtained easily from its relationship with $S(\ell)$ , and it takes the general form $\ell\propto\log_{\zeta}S$ , where $\zeta$ is a model-dependent quantity. Therefore, the relationship between phenotype robustness and the logarithm of phenotype size consistently appears in very generic sequence-to-structure models. The relationship between phenotype robustness and evolvability cannot be derived unless a explicit rule linking possible mutations to phenotypes with different $\ell$ is introduced. In our Models 1, 2, and 3, such a rule, which could take a form analogous to the stop codon of the Fibonacci map [33], is not defined. The case of RNA is particularly interesting and has received significant computational attention since long ago [34]. Only partial explorations of the accessibility of alternative phenotypes have been performed due to the huge sizes of phenotypes [28, 11]. Hopefully, further extensions of our Model 4 could help in the analytical treatment of this highly complex problem. Advances in empirical techniques, such as the intensive use of microarrays, should allow in the near future an exhaustive characterization of actual genotype spaces, as has been done for short transcription factor binding sites [12]. We believe that analyses of empirical GP maps will reveal strengths and weaknesses of the approach here presented, and likely suggest ways of improvement, regarding in particular a formal description of phenotype networks (networks of genotype networks) and evolvability in natural systems.

Appendix A

In order to derive $S(r)$ and $p(S)$ for Model 2 with constrained sites split into two groups at the extremes of the sequence (Section 3.2.1) it will prove convenient to use the affine transformation of $\ell$

[TABLE]

Then Eq. (9) can be rewritten

[TABLE]

from which

[TABLE]

Inversion of this equation yields

[TABLE]

with $W(x)$ Lambert’s product-logarithm function [44, Def. 4.13.1].

Now,

[TABLE]

and using Eq. (36),

[TABLE]

Finally, since $W(z)\sim\log z+O(\log\log z)$ when $z\gg 1$ [44, Prop. 4.13.10], when the rank $r$ is large

[TABLE]

with $a$ a constant. Then, for large $r$ we obtain Eq. (10).

As for $p(S)$ , from Eqs. (7) and (8),

[TABLE]

Differentiating $\log S$ with respect to $\ell$ yields $|S^{\prime}(\ell)|=S\log k$ . Therefore, eliminating $\ell$ from this same equation we end up with

[TABLE]

with $b$ another constant. This is Eq. (11).

Appendix B

The rank function for the case of RNA sequences whose sites may take two values of neutrality $v_{1}$ and $v_{2}$ , a number $Q(L,\ell)$ of secondary structures of length $L$ with $\ell$ sites with neutrality $v_{1}$ and a total number of $Q_{L}$ different secondary structures of lenght $L$ is

[TABLE]

where

[TABLE]

Now, since $\ell-\mu_{L}-\xi\sigma_{L}^{2}$ will be negative for all $\mu_{L}-\sigma_{L}\lesssim\ell\lesssim\mu_{L}+\sigma_{L}$ , we can use the asymptotic expansion of the complementary error function

[TABLE]

to write

[TABLE]

In order to find how the size of a phenotype depends on its rank value $r(\ell)$ it is convenient to introduce new parameters. Let us denote $\mu\equiv\mu_{L}/L$ and $\sigma\equiv\sigma_{L}/\sqrt{L}$ , and

[TABLE]

with $\rho\simeq 1.85$ . The size of a phenotype is given by $S(\ell)=v_{1}^{L-2\ell}v_{2}^{2\ell}$ , therefore

[TABLE]

Now, taking logarithms in (46) and neglecting subdominant terms in $L$ ,

[TABLE]

Hence

[TABLE]

and therefore

[TABLE]

which implies

[TABLE]

Author contributions

SM and JAC designed the study, carried out the calculations, interpreted the results and wrote the manuscript. Both authors read and approved the final text.

Acknowledgements

The authors acknowledge the thorough revision of three anonymous reviewers, which has helped improving this paper.

Funding statement

This work has been supported by the Spanish Ministerio de Economía y Competitividad and FEDER funds of the EU through grants ViralESS (FIS2014-57686-P) and VARIANCE (FIS2015-64349-P).

Bibliography44

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Lipman and Wilbur [1991] David J. Lipman and W. John Wilbur. Modelling neutral and selective evolution of protein folding. Proc. Roy. Soc. London B , 245(1312):7–11, 1991.
2Holzgräfe et al. [2011] Ch. Holzgräfe, A. Irbäck, and C. Troein. Mutation-induced fold switching among lattice proteins. J. Chem. Phys. , 135:195101, 2011.
3Manrubia and Sanjuán [2013] Susanna C. Manrubia and Rafael Sanjuán. Shape matters: effect of point mutations on RNA secondary structure. Adv. Compl. Syst. , 16:1250052, 2013.
4Weirauch and Hughes [2010] Matthew T. Weirauch and Timothy R. Hughes. Conserved expression without conserved regulatory sequence: the more things change, the more they stay the same. Trends Genet. , 26:66–74, 2010.
5Baba et al. [2006] T. Baba, T. Ara, M. Hasegawa, Y. Takai, Y. Okumura, M. Baba, K.A. Datsenko, Tomita M., B.L. Wanner, and H. Mori. Construction of Escherichia coli k-12 in-frame, single-gene knockout mutants: the keio collection. Mol. Sys. Biol. , 20:2006.0008, 2006.
6Rutherford [2000] S. L. Rutherford. From genotype to phenotype: buffering mechanisms and the storage of genetic information. Bioessays , 22:1095–1105, 2000.
7Bloom et al. [2007] J. D. Bloom, P. A. Romero, Z. Lu, and F. H. Arnold. Neutral genetic drift can alter promiscuous protein functions, potentially aiding functional evolution. Biol. Dir. , 2:17, 2007.
8Piatigorsky [2007] Joram Piatigorsky. Gene sharing and evolution: the diversity of protein functions . Harvard University Press, Cambridge, MA, 2007.