A mathematical model for gene evolution after whole genome duplication
Yoji Nakamura

TL;DR
This paper introduces a mathematical model to quantify gene evolution following whole genome duplication, focusing on the balance between gene loss and functional divergence in teleost fish over 90 million years.
Contribution
The study presents a novel equilibrium-based mathematical model to predict gene functional divergence and loss after WGD, applicable to teleosts and potentially other lineages.
Findings
Estimated up to 3000 gene pairs differentiated functionally in 90 million years
Model allows quantitative assessment of WGD impact on genomes
Provides a framework for comparing WGD effects across lineages
Abstract
Whole genome duplication (WGD) is one of the most important events in the molecular evolution of organisms. In fish species, a WGD is considered to have occurred in the ancestral lineage of teleosts. Recent comprehensive ortholog comparisons among teleost genomes have provided useful data and insights into the fate of redundant genes generated by WGD. Based on these data, a mathematical model is proposed to explain the evolutionary scenario of genes after WGD. The model is parameterized taking into account an equilibrium between i) rapid loss of either of the duplicate genes and ii) moderate functional differentiation of each of duplicate genes, both of which are followed by slow gene loss under purifying selection. This model predicts that, in the teleost lineage, a maximum of about 3000 gene pairs may have differentiated functionally during 90 million years after WGD. Thus, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChromosomal and Genetic Variations · Genomics and Phylogenetic Studies · Genetic Mapping and Diversity in Plants and Animals
A mathematical model for gene evolution after whole genome duplication
Y. Nakamura
National Research Institute of Fisheries Science
Japan Fisheries Research and Education Agency
Yokohama, Kanagawa, Japan [email protected]
Abstract
Whole genome duplication (WGD) is one of the most important events in the molecular evolution of organisms. In fish species, a WGD is considered to have occurred in the ancestral lineage of teleosts. Recent comprehensive ortholog comparisons among teleost genomes have provided useful data and insights into the fate of redundant genes generated by WGD. Based on these data, a mathematical model is proposed to explain the evolutionary scenario of genes after WGD. The model is parameterized taking into account an equilibrium between i) rapid loss of either of the duplicate genes and ii) moderate functional differentiation of each of duplicate genes, both of which are followed by slow gene loss under purifying selection. This model predicts that, in the teleost lineage, a maximum of about 3000 gene pairs may have differentiated functionally during 90 million years after WGD. Thus, the present study provides a possibility that the whole impact of WGD can be quantitatively assessed according to the model parameters, before details of genomic structural changes or functional differentiation are investigated. If the equilibrium model is valid not only for teleosts but also for other lineages that have undergone WGDs, correlations between the assessment indices and evolutionarily significant events, such as the diversification of species or the occurrence of novel phenotypes, could be tested and compared among those lineages.
1 Introduction
In the field of genetics, much attention has been given to gene duplication and its significance in the evolution of organisms [1, 2, 3]. Whole genome duplication (WGD), in view of its large scale, is considered to be one of the most significant evolutionary events. A well-known example is that the ancestor of vertebrate species went through at least two WGDs [2, 4] 500–800 million years ago [5]. In the lineage of fish, one more WGD (teleost-specific WGD or TS-WGD) is estimated to have occurred in the ancestor of teleosts 300–400 million years ago [5, 6, 7]. WGD generates functionally identical copies of genes in a genome, resulting in a situation in which either of the duplicate genes becomes a spare or dispensable gene. It is considered that, after such an expansion of genes by WGD, many of the redundant partners become pseudogenes by degenerative mutations [3, 8] and will be finally lost in the course of evolution. Simultaneously, for some gene pairs, the functional redundancy may be lost by chance due to some mutations. In such cases, either or both of the duplicate genes become to have novel roles that are different from the original one [2, 8], and thereby both of them become to evolve mainly under purifying selection. Actually, many cases of the functional differentiation in duplicate genes have been reported [1]. In addition, many theoretical studies have investigated gene duplication from the view of population genetics [9, 10, 11, 12], but studies based on the assessment of large-scale data have not been fully done. Recently, a quantitative comparison among teleost genes was performed at the genomic level, which provided significant implications about the evolutionary fate of duplicate genes after WGD [13]. In particular, the result showed that the loss of redundant genes generated by TS-WGD was very rapid with more than 70% of gene pairs becoming single during 60 million years, after which the rate of gene loss was very slow. In the present study, I propose a simple mathematical model to explain the gene loss patterns observed after WGD. The model is parameterized taking into account an equilibrium between two evolutionary scenarios for duplicate genes: i) rapid loss under functional dispensability to each other, and ii) moderate occurrence of functional differentiation to each other. Additionally, slow gene loss in a conservative manner, in which purifying selection is dominant, is parameterized in the model. In this study, I applied the equilibrium model to the recent data derived from a comparison of teleost genes at the genomic level [13], showed that the model explained the data well, and discussed the potential of the model to assess the impact of WGD on all the genes in the organisms which went through WGDs.
2 Mathematical model
In the equilibrium model, three parameters are defined: , for the rate of loss of a functionally redundant partner in a duplicate gene pair; , for the rate of functional differentiation in a duplicate gene pair; and , for the rate of loss of non-redundant genes. Two types of non-redundant genes are considered: i) single genes that have lost their partners and ii) genes that have functionally differentiated in a pair. Thus, is associated with the normal process of gene evolution under purifying selection, and the value can also be computed in other studies regardless of the context of gene duplication. It should be noted that “functional differentiation” events can occur through mechanisms such as neofunctionalization, subfunctionalization, or dosage selection [14]. Such mechanisms are not distinguished in the model; all of them may be included in the value of , and “functional differentiation” is rather defined as any event in which the type of natural selection acting on genes is switched (i.e., from relaxed selection to purifying selection). In the model, I assumed that the loss of duplicate genes occurs one-by-one; that is, both duplicate genes are not lost at the same time. All the possible states of a gene pair and the corresponding model parameters are summarized in Figure 1.
In addition, I defined three probabilities, , , and , where is the probability that both duplicate genes in a pair remain at time (state = “pair”), is the probability that either of the duplicate genes in a pair is lost at time (state = “single”), and is the probability that both duplicate genes in a pair are lost at time (state = “none”). Here, is the time point at which WGD occurred. In the state of “pair,” each of the gene pairs is in either of two states, “functionally redundant” and “functionally differentiated” (Figure 1), and hence is given by
[TABLE]
In addition, these probabilities satisfy the following differential equations:
[TABLE]
where and . Solving these equations, , and are represented as follows:
[TABLE]
It should be noted that converges to at when , indicating that the loss of gene pairs and the functional differentiation of duplicate genes have reached an equilibrium state. In this study, however, is larger than zero; therefore, continues to decrease and finally converges to zero at .
3 Results and discussion
The equilibrium model was applied to the teleost fish data published by Inoue et al. [13]. These data include information for duplicate genes in a total of 6892 pairs, which were chosen based on a comparison among teleost and outgroup genome data [15]. The genes in these pairs are orthologous among nine teleosts (Mexican tetra, zebrafish, Atlantic cod, Nile tilapia, platyfish, medaka, stickleback, greenpuffer, and fugu), and the conservation or loss of duplicate genes in each of the genomes is recorded in the original data (Figure 2A).
The parameters , , and were fitted using the equation of according to the numbers of gene pairs that were estimated to have been present or were now present at 10 time points (i.e., nodes or edges in the phylogenetic tree) (, and [math] million years, is the time of the TS-WGD, and is the present). From the original data, I counted the numbers of gene pairs that were completely lost in each of the nine teleost genomes; the average was pairs (Table 1). The fitting was done by the primal-dual interior point method [16] implemented in Mathematica ver. 11 (Wolfram Research, Illinois, USA) under the constraint of . As a result, , , and were estimated to be 0.044, 0.0076, and 0.00078, respectively. The behavior of was well matched to that of the actual data (Figure 3A)
and comparable to that of a recent model [13], suggesting that the equilibrium model is a worthy alternative model. In particular, the equilibrium model is consistent with the observation that a substantial number of gene pairs ( in average) were lost in the extant teleost genomes. In the previous model, single genes that had lost their partners were assumed to be indispensable for the teleost; therefore, single genes will never be further lost. Such an assumption seems to be inconsistent with the actual data.
Table 1. Number of gene pairs lost in extant teleosts.
Teleost examined Number of lost gene pairs
Mexican tetra
Zebrafish
Atlantic cod
Nile tilapia
Platyfish
Medaka
Stickleback
Greenpuffer
Fugu
AverageSD
SD, standard deviation.
The estimated , , and parameters were further corrected taking into account a feature of the original data, namely the data were composed of genes that are still present in at least one of the nine teleost genomes. For example, gene pairs that were completely lost during the period of are never transmitted to the teleost genomes examined (Figure 2A); therefore, these pairs should not be counted in the original data. In addition, parallel gene losses in descending sister lineages after the divergence at will make the state “pair” untraceable, resulting in the underestimation of gene pairs (Figure 2B). First, the true number of gene pairs to be observed at the time of TS-WGD was defined as . For the extant gene pairs in the original data, the equation is
[TABLE]
Next, two conditional probabilities, and , were defined: i) the probability that when a gene pair is in the state “pair” at time , either of the duplicate genes will be lost at time ; and ii) the probability that when a gene pair is in the state “pair” at time , both duplicate genes in the pair will be lost at time . These conditional probabilities are given by
[TABLE]
where . Using these probabilities, the ratio of gene pairs that will be underestimated by parallel gene losses between a node of time and two descending sister nodes or edges and (times, and ) is given by
[TABLE]
where and . Note that there are seven patterns of parallel gene loss causing the underestimation of gene pairs in the state “pair” (Figure 2B), which correspond to one of , two of , two of , and two of . In the case of the last nodes (, and ), which are followed by edges, the ratio of underestimation by parallel gene loss is equal to . Therefore, the numbers of gene pairs counted in these nodes were corrected to . Contrastingly, in the case of the deeper nodes ( and ), which are followed by at least one node, the patterns of parallel gene loss are much more complicated. For the deeper nodes, I first performed the simulation of gene loss with uncorrected parameters, then computed the ratio of underestimation by parallel gene losses (Table 2). The results showed that, as in the case of the last nodes, the ratio of gene pairs that will be underestimated by parallel gene losses could be roughly approximated by . Note that the approximation may depend on the number of species examined or the phylogenetic relationship. When other data sets are used, the method of parameter correction may have to be modified. In the present study, the ratio of underestimation at the deepest node () was different by 5% from , probably because the original data were sparse around this node (Figure 3A). Letting the ratio of underestimation of gene pairs at be , the formula is apparently established, where is the ratio of underestimation focusing on only two distantly related species (e.g., zebrafish and medaka). Therefore, was approximated by in this study. Finally, for the period , the losses of gene pairs until the next two nodes ( and ,) were taken into account, and the proportion of gene pairs to be unobservable in the original data was approximated by
[TABLE]
The value of was about 0.031 with the uncorrected parameters, close to the simulated estimate (Table 2). Thus, the following equations were obtained:
[TABLE]
According to these equations, the four parameters were fitted again by the Newton-Raphson method with initial values of = 6892, = 0.044, = 0.0076, and = 0.00078. As a result, was estimated to be 7143, and , , and were corrected to 0.036, 0.0062, and 0.00095, respectively. The values of , , and increased or decreased by about 20%, and the total number of gene pairs was 251 pairs more than the original number. The plots of , , and with the corrected parameters are shown in Figure 3. Little difference was observed in the shape of the curves obtained with the corrected and uncorrected parameters. The number of lost gene pairs to be counted was corrected to 1476 (Figure 3B), but the number observable in the extant genomes was estimated to be 1209 according to the above equation, which was close to the number in the original data.
Table 2. Underestimation of gene pairs in the state “pair” by parallel gene losses.
Time point of node () Ratio of underestimation (SD)
Time of TS-WGD Ratio of underestimation (SD)
Simulations were performed 1000 times.
SD, standard deviation.
TS-WGD, teleost-specific whole genome duplication.
In this equilibrium model, the value of converges to 1 at (), indicating that, theoretically, all the genes will be lost in the long-term future. Such a prediction seems to be unnatural from the view of genome evolution. However, it should be noted that the start number of gene pairs was fixed in the modeling ( = 7143, or originally 6892), and the gene gain event was not taken into account. It is possible that the number of genes gained after WGD may compensate for the number lost. In addition, the value of is very small, therefore it will take about 700 million years from the present for the number of genes to decrease by half ( = 3571.5) according to the model. This is a long enough time for the gene content to be influenced by many other evolutionary mechanisms; thus, ultra-long-term predictions (about 1000 million years after WGD) by the equilibrium model are not practical. Rather, the equilibrium model estimates the evolutionary features of duplicate genes in the present or before. Here, I focused on , the probability that a pair of genes derived from WGD differentiated functionally at time (Figure 1). The ratio of to the probability that a gene pair remains at the present, that is , was almost 1, implying that almost all the gene pairs present in the extant teleosts have already differentiated functionally. This conjecture is consistent with the result from a recent gene expression analysis study [17], but that is based on a limited number of gene pairs examined only in zebrafish. Further researches using many genes and/or many teleost genomes will need to be carried out to test the equilibrium model.
The advantage of equilibrium model is that the impact of WGD can be quantitatively assessed using the model parameters. For example, the value of , which is the rate of functional differentiation in a gene pair, may be correlated with the occurrence of novel genotypes triggered by WGD. Many detailed models about the mechanisms of functional differentiation have been proposed (reviewed in [18]), and may be regarded as an averaged index of the effects of these mechanisms at the genomic level. In addition, indicates the ideal proportion of functionally differentiated gene pairs out of all gene pairs when and . Practically, () better reflects the proportion,
[TABLE]
and the maximum is obtained using the derivative . For the teleost data used in the present study, the maximum of was about 0.13 with (Figure 4).
Thus, assuming that the total gene number for standard teleost species is 20000–25000 [19, 20], a maximum of 2600–3300 gene pairs were estimated to have differentiated functionally during 90 million years after TS-WGD. In the case of plant species, it was reported that 99% of about 2000 duplicate gene pairs that were examined in the cotton genome had differentiated at the gene expression level during 60 million years after WGD, and probably those have evolved under purifying selection [21]. Therefore, the estimate of functionally differentiated gene pairs in teleost species might not be very surprising. It should be stressed that if the equilibrium model is valid not only for teleosts but also for other lineages that have undergone WGDs, the above-mentioned indices could be compared among such lineages. From a naïve perspective, the value of or (or ) may be directly or indirectly associated with evolutionarily significant events in the lineages examined, such as the occurrence of novel phenotypes, or other evolutionary features such as the diversity of species or population size. Regarding the teleosts, it is known that one more WGD occurred recently (100 million years ago) in the lineage of Salmonids after the TS-WGD (reviewed in [22]). Therefore, further comparisons using Salmonid genomic data may allow the impact of WGD to be assessed and compared within the teleost lineage.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Graur D, Li W-H. (2000). Fundamentals of Molecular Evolution . 2nd ed. Sunderland, MA: Sinauer.
- 2[2] Ohno S. (1970). Evolution by Gene Duplication . Berlin: Springer-Verlag.
- 3[3] Haldane JBS. (1933). The part played by recurrent mutation in evolution. The American Naturalist 67 (708):5-19.
- 4[4] Holland PW, Garcia-Fernandez J, Williams NA, Sidow A. (1994). Gene duplications and the origins of vertebrate development. Development Supplement:125-33.
- 5[5] Vandepoele K, De Vos W, Taylor JS, Meyer A, Van de Peer Y. (2004). Major events in the genome evolution of vertebrates: paranome age and size differ considerably between ray-finned fishes and land vertebrates. Proc Natl Acad Sci USA . 101 (6):1638-43.
- 6[6] Taylor JS, Van de Peer Y, Braasch I, Meyer A. (2001). Comparative genomics provides evidence for an ancient genome duplication event in fish. Philos Trans R Soc Lond B Biol Sci . 356 (1414):1661-79.
- 7[7] Wittbrodt J, Meyer A, Schartl M. (1998). More genes in fish? Bio Essays 20 (6):511-5.
- 8[8] Lynch M, Conery JS. (2000). The evolutionary fate and consequences of duplicate genes. Science 290 (5494):1151-5.
