Counting and sampling gene family evolutionary histories in the   duplication-loss and duplication-loss-transfer models

Cedric Chauve; Yann Ponty; Michael Wallner

arXiv:1905.04971·math.CO·May 14, 2019

Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models

Cedric Chauve, Yann Ponty, Michael Wallner

PDF

1 Repo

TL;DR

This paper introduces formal methods to count and generate gene family evolutionary histories considering duplication, loss, and transfer, revealing exponential growth and the impact of horizontal gene transfer.

Contribution

It develops grammars and algorithms to analyze the space of gene family histories under the DLT-model, including asymptotic counts and random generation methods.

Findings

01

Including horizontal gene transfer greatly increases the number of histories.

02

Number of histories is nearly independent of species tree topology within ranked trees.

03

Exact asymptotics are obtained for specific species tree shapes.

Abstract

Given a set of species whose evolution is represented by a species tree, a gene family is a group of genes having evolved from a single ancestral gene. A gene family evolves along the branches of a species tree through various mechanisms, including - but not limited to - speciation, gene duplication, gene loss, horizontal gene transfer. The reconstruction of a gene tree representing the evolution of a gene family constrained by a species tree is an important problem in phylogenomics. However, unlike in the multispecies coalescent evolutionary model, very little is known about the search space for gene family histories accounting for gene duplication, gene loss and horizontal gene transfer (the DLT-model). We introduce the notion of evolutionary histories defined as a binary ordered rooted tree describing the evolution of a gene family, constrained by a species tree in the DLT-model. We…

Figures18

Click any figure to enlarge with its caption.

Tables4

Table 1. Table 1: Leading terms for the time ( Φ ( n , k ) Φ 𝑛 𝑘 \Phi(n,k) ) and space ( Φ ( n , k ) Φ 𝑛 𝑘 \Phi(n,k) ) complexities incurred by the evaluation of the counting recurrences for histories consisting of n 𝑛 n genes in a species tree of size k 𝑘 k .

\begin{matrix} Counting Time ​ Φ ​ (n, k) & 𝔻 ​ 𝕃 & 𝔻 ​ 𝕃 ​ 𝕋 \\ Unranked & k ​ n^{2} & k^{2} ​ n^{2} \\ Ranked & k^{2} ​ n^{2} & k^{3} ​ n^{2} \end{matrix} \begin{matrix} Counting Space ​ Ψ ​ (n, k) & 𝔻 ​ 𝕃 & 𝔻 ​ 𝕃 ​ 𝕋 \\ Unranked & k ​ n^{2} & k^{2} ​ n^{2} \\ Ranked & k^{2} ​ n^{2} & k^{3} ​ n^{2} \end{matrix}

Table 2. Table 2: Leading constants and exponential growth factors for the number of 𝔻 𝕃 𝔻 𝕃 \mathbb{D}\mathbb{L} -histories consistent with the unranked caterpillar and complete species tree. Their closed forms are given in Propositions 1 – 4 .

$H_{k}^{𝐂𝐓}$	$= D_{k}^{𝐂𝐓} + S_{k}^{𝐂𝐓}$	if $k > 1$	(19)
$H_{1}^{𝐂𝐓}$	$= 𝒵 + D_{1}^{𝐂𝐓}$		(20)
$S_{k}^{𝐂𝐓}$	$= H_{k - 1}^{𝐂𝐓} \times H_{0}^{𝐂𝐓} + H_{k - 1}^{𝐂𝐓} + H_{0}^{𝐂𝐓}$	if $k > 1$	(21)
$D_{k}^{𝐂𝐓}$	$= H_{k}^{𝐂𝐓} \times H_{k}^{𝐂𝐓}$		(22)

Table 3. Table 3: 𝔻 𝕃 𝔻 𝕃 \mathbb{D}\mathbb{L} -history counting sequences of the caterpillar species trees 𝐂𝐓 k subscript 𝐂𝐓 𝑘 \mathbf{CT}_{k} .

$k$	Sequence	OEIS
$1$	$1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, \dots$	A000108
$2$	$2, 7, 34, 200, 1318, 9354, 69864, 541323, 4310950, 35066384, \dots$	A307696
$3$	$3, 19, 159, 1565, 17022, 197928, 2413494, 30490089, 395828145, \dots$	A307697
$4$	$4, 39, 495, 7235, 115303, 1948791, 34379505, 626684162, \dots$	A307698
$5$	$5, 69, 1230, 24843, 541315, 12426996, 296546600, 7292489761, \dots$	A307700

Table 4. Table 4: 𝔻 𝕃 𝔻 𝕃 \mathbb{D}\mathbb{L} -history counting sequences of the complete species trees 𝐂𝐁 h subscript 𝐂𝐁 ℎ \mathbf{CB}_{h} with k = 2 h 𝑘 superscript 2 ℎ k=2^{h} leaves.

$h$	$k$	Sequence	OEIS
$0$	$1$	$1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, \dots$	A000108
$1$	$2$	$2, 7, 34, 200, 1318, 9354, 69864, 541323, 4310950, 35066384, \dots$	A307696
$2$	$4$	$4, 34, 368, 4685, 66416, 1013268, 16279788, 271594611, 4660794200, \dots$	A307941
$3$	$8$	$8, 148, 3376, 89390, 2624872, 82866636, 2755019736, 95135709027, \dots$	A307942
$4$	$16$	$16, 616, 28832, 1556780, 93017264, 5971377672, 403667945712, \dots$	A307943

Equations107

H_{u}

H_{u}

H_{u}

S_{u}

S_{u}

D_{u}

T_{u}

H [u, n]

H [u, n]

H [u, n]

S [u, n]

D [u, n]

T [u, n]

\begin{array}[]{c@{\hspace{1em}}c@{\hspace{1em}}c}\hline\cr\hline\cr\text{Generation Time }\Psi(n,k)\hfil\hskip 10.00002pt&\mathbb{D}\mathbb{L}\hfil\hskip 10.00002pt&\mathbb{D}\mathbb{L}\mathbb{T}\\ \hline\cr\text{Unranked}\hfil\hskip 10.00002pt&n\log n\hfil\hskip 10.00002pt&k\,n\log n\\ \text{Ranked}\hfil\hskip 10.00002pt&n\log n\hfil\hskip 10.00002pt&k\,n\log n\\ \hline\cr\hline\cr\end{array}

\begin{array}[]{c@{\hspace{1em}}c@{\hspace{1em}}c}\hline\cr\hline\cr\text{Generation Time }\Psi(n,k)\hfil\hskip 10.00002pt&\mathbb{D}\mathbb{L}\hfil\hskip 10.00002pt&\mathbb{D}\mathbb{L}\mathbb{T}\\ \hline\cr\text{Unranked}\hfil\hskip 10.00002pt&n\log n\hfil\hskip 10.00002pt&k\,n\log n\\ \text{Ranked}\hfil\hskip 10.00002pt&n\log n\hfil\hskip 10.00002pt&k\,n\log n\\ \hline\cr\hline\cr\end{array}

γ_{S} \frac{ρ _{S}^{- n}}{n ^{3/2}} (1 + O (\frac{1}{n})),

γ_{S} \frac{ρ _{S}^{- n}}{n ^{3/2}} (1 + O (\frac{1}{n})),

B = Z \cup B^{2} .

B = Z \cup B^{2} .

B (z) = z + B (z)^{2} .

B (z) = z + B (z)^{2} .

b_{n}

b_{n}

H_{u} (z) = n \geq 0 \sum h_{u, n} z^{n} .

H_{u} (z) = n \geq 0 \sum h_{u, n} z^{n} .

H_{u} (z) H_{u} (z) = B (H_{u_{ℓ}} (z) H_{u_{r}} (z) + H_{u_{ℓ}} (z) + H_{u_{r}} (z)) = B (z) if u is internal, if u is a leaf,

H_{u} (z) H_{u} (z) = B (H_{u_{ℓ}} (z) H_{u_{r}} (z) + H_{u_{ℓ}} (z) + H_{u_{r}} (z)) = B (z) if u is internal, if u is a leaf,

B (z) = \frac{1 - 1 - 4 z}{2} .

B (z) = \frac{1 - 1 - 4 z}{2} .

H_{u} (z) H_{u} (z) = H_{u} (z)^{2} + H_{u_{ℓ}} (z) H_{u_{r}} (z) + H_{u_{ℓ}} (z) + H_{u_{r}} (z) = H_{u} (z)^{2} + z if u is internal, if u is a leaf.

H_{u} (z) H_{u} (z) = H_{u} (z)^{2} + H_{u_{ℓ}} (z) H_{u_{r}} (z) + H_{u_{ℓ}} (z) + H_{u_{r}} (z) = H_{u} (z)^{2} + z if u is internal, if u is a leaf.

H_{u} (z) = \frac{1 - R _{u} ( u )}{2} .

H_{u} (z) = \frac{1 - R _{u} ( u )}{2} .

R_{u} (z) R_{u} (z) = - 4 + 3 R_{u_{ℓ}} (z) + 3 R_{u_{r}} (z) - R_{u_{ℓ}} (z) R_{u_{r}} (z) = 1 - 4 z if u is internal, if u is a leaf.

R_{u} (z) R_{u} (z) = - 4 + 3 R_{u_{ℓ}} (z) + 3 R_{u_{r}} (z) - R_{u_{ℓ}} (z) R_{u_{r}} (z) = 1 - 4 z if u is internal, if u is a leaf.

R_{u} (0) = - 4 + 3 R_{v} (0) + 3 R_{w} (0) - R_{v} (0) R_{w} (0) = 1.

R_{u} (0) = - 4 + 3 R_{v} (0) + 3 R_{w} (0) - R_{v} (0) R_{w} (0) = 1.

R_{u} (z) = 1 - n \geq 1 \sum a_{n} z^{n},

R_{u} (z) = 1 - n \geq 1 \sum a_{n} z^{n},

R_{u} (ρ_{v}) = - 4 + 3 R_{w} (ρ_{v}) < 0.

R_{u} (ρ_{v}) = - 4 + 3 R_{w} (ρ_{v}) < 0.

B_{i} = Φ_{i} (z, B_{1}, \dots, B_{k}),

B_{i} = Φ_{i} (z, B_{1}, \dots, B_{k}),

\displaystyle\left\{\begin{array}[]{rl}b_{1}&=\Phi_{1}(\rho,b_{1},\ldots,b_{k})\\ &\quad\vdots\\ b_{k}&=\Phi_{k}(\rho,b_{1},\ldots,b_{k})\\ 0&=\det\left(\delta_{i,j}-\frac{\partial}{\partial b_{j}}\Phi_{i}(\rho,b_{1},\ldots,b_{k})\right),\end{array}\right.

\displaystyle\left\{\begin{array}[]{rl}b_{1}&=\Phi_{1}(\rho,b_{1},\ldots,b_{k})\\ &\quad\vdots\\ b_{k}&=\Phi_{k}(\rho,b_{1},\ldots,b_{k})\\ 0&=\det\left(\delta_{i,j}-\frac{\partial}{\partial b_{j}}\Phi_{i}(\rho,b_{1},\ldots,b_{k})\right),\end{array}\right.

\displaystyle\left\{\begin{array}[]{rl}B_{1}&=\Phi_{1}(z,B_{1})\\ B_{2}&=\Phi_{2}(z,B_{1},B_{2})\\ &\quad\vdots\\ B_{k}&=\Phi_{k}(z,B_{1},\ldots,B_{k})\\ \end{array}\right.

\displaystyle\left\{\begin{array}[]{rl}B_{1}&=\Phi_{1}(z,B_{1})\\ B_{2}&=\Phi_{2}(z,B_{1},B_{2})\\ &\quad\vdots\\ B_{k}&=\Phi_{k}(z,B_{1},\ldots,B_{k})\\ \end{array}\right.

\displaystyle\left\{\begin{array}[]{rl}B_{1}&=\Phi_{1}(z,B_{1},B_{2},\ldots,B_{k-1})\\ &\quad\vdots\\ B_{k-1}&=\Phi_{k-1}(z,B_{1},\ldots,B_{k-1})\\ B_{k}&=\Phi_{k}(z,B_{1},\ldots,B_{k})\\ \end{array}\right.

\displaystyle\left\{\begin{array}[]{rl}B_{1}&=\Phi_{1}(z,B_{1},B_{2},\ldots,B_{k-1})\\ &\quad\vdots\\ B_{k-1}&=\Phi_{k-1}(z,B_{1},\ldots,B_{k-1})\\ B_{k}&=\Phi_{k}(z,B_{1},\ldots,B_{k})\\ \end{array}\right.

1

1

H_{u} (z) = \frac{1}{2} - i \geq 0 \sum γ_{u, i} (1 - z / ρ_{u})^{i + 1/2} .

H_{u} (z) = \frac{1}{2} - i \geq 0 \sum γ_{u, i} (1 - z / ρ_{u})^{i + 1/2} .

F_{k} (z) = n \geq 0 \sum f_{k, n} z^{n},

F_{k} (z) = n \geq 0 \sum f_{k, n} z^{n},

F_{k} (z) = B (F_{k - 1}^{2} (z) + F_{k - 1} (z) + B (z)) .

F_{k} (z) = B (F_{k - 1}^{2} (z) + F_{k - 1} (z) + B (z)) .

f_{k, n} = α_{k} \frac{λ _{k}^{- n}}{n ^{3/2}} (1 + O (\frac{1}{n})) .

f_{k, n} = α_{k} \frac{λ _{k}^{- n}}{n ^{3/2}} (1 + O (\frac{1}{n})) .

{s_{1} (X) = 0, s_{k} (X) = \frac{a ( X ) - s _{k - 1} ( X ) ^{2}}{b ( X )} for k > 1.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cchauve/DLTcount
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

11institutetext: Cedric Chauve 22institutetext: Department of Mathematics, Simon Fraser University, Burnaby (BC), Canada

LaBRI, Université de Bordeaux, Talence, France

LIX, Ecole Polytechnique, Palaiseau, France

0000-0001-9837-1878 33institutetext: Yann Ponty 44institutetext: CNRS and LIX, Ecole Polytechnique, Palaiseau, France

0000-0002-7615-3930 55institutetext: Michael Wallner 66institutetext: LaBRI, Université de Bordeaux, Talence, France

Institut für Diskrete Mathematik und Geometrie, TU Wien, Vienna, Austria

0000-0001-8581-449X

Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models

Cedric Chauve

Yann Ponty

Michael Wallner

Abstract

Given a set of species whose evolution is represented by a species tree, a gene family is a group of genes having evolved from a single ancestral gene. A gene family evolves along the branches of a species tree through various mechanisms, including – but not limited to – speciation ( $\mathbb{S}$ ), gene duplication ( $\mathbb{D}$ ), gene loss ( $\mathbb{L}$ ), horizontal gene transfer ( $\mathbb{T}$ ). The reconstruction of a gene tree representing the evolution of a gene family constrained by a species tree is an important problem in phylogenomics. However, unlike in the multispecies coalescent evolutionary model that considers only speciation and incomplete lineage sorting events, very little is known about the search space for gene family histories accounting for gene duplication, gene loss and horizontal gene transfer (the $\mathbb{D}\mathbb{L}\mathbb{T}$ -model).

In this work, we introduce the notion of evolutionary histories defined as a binary ordered rooted tree describing the evolution of a gene family, constrained by a species tree in the $\mathbb{D}\mathbb{L}\mathbb{T}$ -model. We provide formal grammars describing the set of all evolutionary histories that are compatible with a given species tree, whether it is ranked or unranked. These grammars allow us, using either analytic combinatorics or dynamic programming, to efficiently compute the number of histories of a given size, and also to generate random histories of a given size under the uniform distribution. We apply these tools to obtain exact asymptotics for the number of gene family histories for two species trees, the rooted caterpillar and the complete binary tree, as well as estimates of the range of the exponential growth factor of the number of histories for random species trees of size up to $25$ . Our results show that including horizontal gene transfer induce a dramatic increase of the number of evolutionary histories. We also show that, within ranked species trees, the number of evolutionary histories in the $\mathbb{D}\mathbb{L}\mathbb{T}$ -model is almost independent of the species tree topology. These results establish firm foundations for the development of ensemble methods for the prediction of reconciliations.

**Mathematics Subject Classification (2000)**92B99,05A15,05A16

Keywords:

Phylogenetics, Enumerative Combinatorics, Asymptotics, Sampling Algorithms

1 Introduction

A gene tree represents the evolution of a gene family, a group of genes assumed to descend from a single ancestral gene. The reconstruction of gene trees from molecular sequence data is a central but difficult problem in computational biology. Indeed, while species are mostly expected to evolve through speciation, gene families evolve through a wider variety of mechanisms including gene duplication, gene loss, horizontal gene transfer (HGT) and incomplete lineage sorting (ILS). As a result, it is common to observe an incongruence between gene trees and species trees sysbio/Maddison97 . This discrepancy has motivated an intense research activity on the problem of reconstructing the gene tree of a gene family, conditional to a given species tree for the considered species. We refer to mmb/SzollosiD12 ; sysbio/SzollosiTDB15 for extensive reviews discussing how gene trees evolve within a species tree, describe existing models and methods for reconstructing gene trees within species trees.

In the case where a gene family contains a single gene per species, observed incongruences between a gene tree and a species tree can be analyzed through the prism of ILS in the multispecies coalescent model tree/DegnanR09 . The natural question is then to compute the probability of coalescent histories conditional to the given species tree evolution/DegnanS05 ; evolution/Wu12 ; bioinformatics/Wu16 ; bioinformatics/Wu17 . For gene families that might contain duplicate copies (or no copy) of a gene in a given species, the multispecies coalescent model is not appropriate, and gene trees need to be inferred in a model including gene duplication, gene loss and, ideally, transfers. Most methods developed to understand the evolution of gene families in this context rely on the concept of gene tree-species tree reconciliation, illustrated in Fig. 1. In this framework, given a gene tree $G$ and a species tree $S$ , one aims to embed $G$ within $S$ , often optimizing a parsimony or probabilistic criterion with regard to the considered evolutionary model.

Early reconciliation methods were developed for an evolutionary model considering only gene duplications and gene losses (the $\mathbb{D}\mathbb{L}$ -model), and considered a parsimony criterion. This problem, introduced by Goodman et al. syszoo/GoodmanCMRM79 , is computationally tractable through dynamic programming. Extending the model to include HGT, while ensuring that HGT events are time-consistent, makes the problem of predicting of the most parsimonious reconciliation intractable in general cmb/OvadiaFCL11 ; tcbb/TofighHL11 . However, if the provided species tree is ranked, i.e. is provided with a total ordering of its internal nodes describing the order of speciation events, the reconciliation problem becomes tractable (see the discussion in bib/DoyonRDB11 ). Over the last 20 years, various efficient dynamic programming algorithms were designed to compute a parsimonious reconciliation, implemented in widely used phylogenomics packages cmb/DurandHV06 ; bioinformatics/BansalKKK18 ; bioinformatics/ScornavaccaJS15 ; bioinformatics/JacoxCSPS16 . Similar to parsimony-based methods, probabilistic reconciliation methods were first developed in a model considering only gene duplication and gene loss jacm/ArvestadLS09 ; pnas/AkerborgSAL09 ; bmcbi/GoreckiBE11 ; cmb/GoreckiE14 , before being extended to include HGTs sysbio/SzollosiRBTD13 ; sysbio/SjostrandTDASL14 .

Most methods that reconstruct a gene tree, conditional to a species tree, rely on the exploration of the space of possible evolutionary histories. It is then important to develop conceptual tools that can describe this combinatorial space and further enable its efficient exploration. This naturally raises the questions to compute the size of the space of evolutionary histories for a given gene family and a given species tree, and to be able to sample such histories. Both questions are naturally related, as precise counting results often translate into efficient sampling algorithms wilf1977unified ; flajolet1994calculus . The former (counting) question has been studied by Rosenberg et al. in the case of the multispecies coalescent model jcb/Rosenberg07 ; jcb/DisantoR15 ; tcbb/DisantoR16 ; jcb/DisantoR17 ; bmb/DisantoR17 ; jmb/DisantoR18 . However similar questions have not been explored as thoroughly for evolutionary models including gene duplication, gene loss and HGT. In this framework, dynamic programming equations aimed at computing a parsimonious reconciled gene tree can be turned into a specification of the corresponding search space tcs/GoreckiT06 ; jmb/RanwezSDB16 . This then leads to efficient algorithms for counting or sampling parsimonious reconciliations cmb/DoyonCH09 ; cmb/BansalAK13 or sampling reconciled gene trees under the Boltzmann probability distribution bioinformatics/JacoxCSPS16 . However, to the best of our knowledge, such questions have not been considered in the case where a gene tree is not specified at first, i.e. we are only given a species tree and gene family.

This paper provides analytic and algorithmic answers to those questions. We show that, for a given species tree, whether ranked or unranked, the space of all possible evolutionary histories of a fixed size in the $\mathbb{D}\mathbb{L}\mathbb{T}$ -model can be described using a formal grammar. This allows us to compute, in polynomial time and space, for given species tree and gene family size, the number of evolutionary histories of this size conditional to the given species tree, as well as to sample among these histories under the uniform probability. Using these algorithms, we can provide estimates of the exponential growth factor of the number of histories in the $\mathbb{D}\mathbb{L}$ -model and $\mathbb{D}\mathbb{L}\mathbb{T}$ -model. We show that, as expected, including HGT in a model results in an exponential increase of the number of histories. We also notice that with a ranked species tree, the exponential growth factor of the number of histories in the $\mathbb{D}\mathbb{L}\mathbb{T}$ -model seems to be almost independent of the chosen species tree. Finally, using enumerative and analytic combinatorics, we provide exact values for the asymptotic number of histories for two specific species tree: the rooted caterpillar tree and the rooted complete binary tree.

2 Model: gene families evolutionary histories

In this section, we introduce the combinatorial objects modeling the evolution of a gene family within a given species tree, that we call histories.

Preliminaries on trees.

For a given rooted tree111In the present work we consider only rooted trees. $\mathbf{T}$ , we say it is uniquely labeled if every node has a label, and no two nodes have the same label. For a node $x$ in $\mathbf{T}$ , we denote by $\mathbf{T}_{x}$ the subtree of $\mathbf{T}$ rooted at $x$ . In this work, we consider only binary and unary-binary trees: in a binary tree, every internal node has exactly two children, while in a unary-binary tree, an internal node can have either one child or two children. If a uniquely labeled tree $\mathbf{T}$ is unordered we take advantage of the nodes labeling to see it as an ordered tree, with the two children of an internal node $x$ being ordered from left to right in increasing order of their labels; so from now on all trees we consider are ordered. If an internal node $x$ of a tree $\mathbf{T}$ is binary, we denote by $x_{\ell}$ the left child of $x$ and by $x_{r}$ its right child; if $x$ is unary, i.e. has a single child, we denote it by $x_{c}$ . We denote by $r(\mathbf{T})$ the root of $\mathbf{T}$ . For a node $x$ of $\mathbf{T}$ , we denote by $p(x)$ its parent in $\mathbf{T}$ . The size of a tree $\mathbf{T}$ is the number of its leaves.

A rooted tree describes a partial order on the set of its nodes, and two nodes are said to be comparable if one is an ancestor of the other one and incomparable otherwise. For a node $u$ , we denote by $\overline{C}(u)$ the set of nodes that are incomparable with $u$ .

Ranked trees.

A ranking of a tree $T$ of size $n$ is a mapping $\pi$ from the nodes of $T$ to $\{1,\dots,n\}$ such that (1) $\pi(x)=n$ if $x$ is a leaf, (2) $\pi(x)\neq\pi(y)$ if $x$ and $y$ are internal nodes, and (3) $\pi(x)<\pi(y)$ if $x$ is an ancestor of $y$ . A tree augmented with a ranking is called a ranked tree; in our context it models the evolution of a set of species, the ranking providing the relative order of speciation events, under the assumption that no two speciations can occur at the same time.

Given a binary tree $\mathbf{T}$ and a ranking $\pi$ , we define an unranked unary-binary tree $\mathbf{T}_{\pi}$ that encodes the ranking information as follows: for each internal node $u$ , considered iteratively in increasing ranking order, and for every edge $(p(v),v)$ such that $\pi(p(v))<\pi(u)<\pi(v)$ , we subdivide the edge $(p(v),v)$ into two edges $(p(v),v_{u})$ and $(v_{u},v)$ , so adding a unary node $v_{u}$ on this edge. We denote by $t(u)$ the set of all unary nodes created in this way and we call this set of nodes together with $u$ a time slice. Additionally, we also define the set of all leaves as a time slice (see Figure 2). Note that in this way we create $n$ different time slices which correspond to the $n$ different values of the ranking. We modify the notion of incomparability for such unary-binary trees as follows: for a node $u$ , $\overline{C}(u)=t(u)\setminus\{u\}$ .

Gene Families Evolutionary Histories.

The objects we study in this work model the evolution of a gene family within a species tree. A species tree, which will be denoted by $\mathbf{S}$ from now on, is a uniquely labeled rooted binary tree that represents the evolution of a set of species through speciation events; $\mathbf{S}$ can be either unranked or ranked. A gene family evolves within $\mathbf{S}$ from a single ancestral gene, present in the species $r(\mathbf{S})$ , through four possible kinds of evolutionary events:

•

Speciation $\mathbb{S}$ : a gene $x$ present in species $u$ has two descendant genes $x_{\ell}$ present in species $u_{\ell}$ and $x_{r}$ present in species $u_{r}$ .

•

Duplication $\mathbb{D}$ : a gene $x$ present in species $u$ is duplicated, with a new copy $x_{d}$ of $x$ appearing in species $u$ ; $x$ is said to be the original gene while $x_{d}$ is the novel gene.

•

Loss $\mathbb{L}$ : a gene $x$ present in species $u$ has exactly one descendant either in $x_{\ell}$ or in $x_{r}$ , implying that after a speciation at species $u$ , exactly one of the two resulting genes is lost along the branch toward either $u_{\ell}$ or $u_{r}$ .

•

Horizontal Gene Transfer $\mathbb{T}$ (HGT): this is similar to a duplication but the novel copy, denoted $x_{t}$ here, appears in a species $v$ different from $u$ and incomparable with $u$ , called the receiver of the HGT, while $u$ is called the donor of the HGT. If $\mathbf{S}$ is ranked, with ranking $\pi$ , the receiver species $v$ is required to exist at the same time as $u$ , i.e. to satisfy two ranking constraints, $\pi(p(v))<\pi(u)<\pi(v)$ .

Definition 1

An evolutionary history for a gene family within a species tree $\mathbf{S}$ is a unary-binary ordered rooted tree $\mathbf{T}$ together with two mappings $s:\ V(\mathbf{T})\rightarrow V(\mathbf{S})$ and $e:\ V(\mathbf{T})\rightarrow\{\mathbb{S},\mathbb{D},\mathbb{L},\mathbb{T},Extant\}$ satisfying the following constraints:

•

if $x$ is a leaf, $e(x)\in\{Extant,\mathbb{L}\}$ ;

•

if $x$ is internal and binary, $e(x)\in\{\mathbb{S},\mathbb{D},\mathbb{T}\}$ ;

•

if $x$ is internal and unary then $e(x)=\mathbb{S}$ 222Note that technically the event associated to a unary node in the species tree is not speciation in the biological meaning, but we chose to label it as such for expository reasons.;

•

if $e(x)=\mathbb{S}$ and $s(x)=u$ is binary then $s(x_{\ell})=u_{\ell}$ and $s(x_{r})=u_{r}$ ;

•

if $e(x)=\mathbb{S}$ and $s(x)=u$ is unary then $s(x_{c})=u_{c}$ ;

•

if $e(x)=\mathbb{D}$ then $s(x_{\ell})=s(x_{r})=s(x)$ ;

•

if $e(x)=\mathbb{T}$ then $s(x_{\ell})=s(x)$ and $s(x_{r})\in\overline{C}(s(x))$ .

The size of a history is the number of leaves $x$ such that $e(x)=Extant$ .

Intuitively, this definition states that a history is represented by a tree where each node corresponds to a gene present in a species, either extant or ancestral (the mapping $s$ ), and each ancestral gene either was lost ( $e(x)=\mathbb{L}$ ) or evolved toward extant genes through a duplication ( $e(x)=\mathbb{D}$ ), an HGT to an incomparable receiver species ( $e(x)=\mathbb{T}$ ) or a speciation ( $e(x)=\mathbb{S}$ ), while extant genes belong to extant species; the constraints on the species mapping $s$ ensure that this history can be embedded within $\mathbf{S}$ as illustrated in Figure 1.

By convention, for duplications, we consider that the novel copy of a gene $x$ is its right child $x_{r}$ , $x_{\ell}$ representing the original copy. Histories considered by the $\mathbb{D}\mathbb{L}$ -model, which allows both duplications and losses (resp. duplications, losses and HGTs), are called $\mathbb{D}\mathbb{L}$ -histories (resp. $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories).

Remark 1

By modeling the evolution of a gene family with ordered trees we differ from the classical notion of reconciliation, that also models the evolution of a gene family but considers that when a gene duplication occurs, the original gene and the novel gene are indistinguishable. As a result, the children of a duplication are ordered within a history, whereas they are not in a reconciliation.

Remark 2

Gene losses are modeled as speciation events with one disappearing gene. As a consequence, we can not have a duplication or a HGT that results in one of the resulting two gene copies being lost. This is necessary to avoid creating an infinite number of histories of a given size, due to an arbitrary number of duplications within a species, each followed by a loss, or an arbitrary long sequence of HGT, again each followed by a loss, leading to at most one extant gene.

Time Consistency of $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories.

Given an unranked species tree $\mathbf{S}$ , a $\mathbb{D}\mathbb{L}\mathbb{T}$ -history as defined above is time inconsistent if there exists a gene $x$ belonging to a species $u$ such that one of its ancestors belongs to a species $v$ and one of its descendants belongs to a species $v^{\prime}$ ancestral to $v$ . This pattern can be observed due to the fact that, in the definition of a $\mathbb{D}\mathbb{L}\mathbb{T}$ -history, the choice of the receiver species $v$ of an HGT of gene $x$ belonging to species $u$ is not restricted to the set of species that are also incomparable with all species containing genes that are ancestral to $x$ ; see Figure 3 for an illustration.

The problem of computing gene family evolutionary scenarios that are both parsimonious and time-consistent has been shown to be intractable when such scenarios are modeled by reconciliations with an unranked species tree tcbb/TofighHL11 ; cmb/OvadiaFCL11 , while, when the provided species tree $\mathbf{S}$ is ranked, the problem becomes tractable (see bib/DoyonRDB11 and references therein). Similarly, when $\mathbf{S}$ is ranked, we can ensure time-consistency of evolutionary histories, by requiring that the donor and receiver of any HGT belong to the same time slice in $\mathbf{S}_{\pi}$ , i.e. the receiver of an HGT of a gene belonging to a species $u$ belongs to $\overline{C}(u)=t(u)-\{u\}$ .

3 Methods

Our results (counting and sampling algorithms) are based on the design of formal grammars specifying, for a given species tree $\mathbf{S}$ , the combinatorial families of $\mathbb{D}\mathbb{L}$ -histories and $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories constrained by $\mathbf{S}$ . These grammars are then used as templates to design dynamic programming algorithms for counting and sampling (under the uniform distribution) the number of histories of a fixed size. Moreover, these grammars are amenable to techniques of analytic combinatorics that allow us to compute the asymptotic growth constant for the number of histories. We first describe our grammars, then the counting and sampling algorithms, and finally the asymptotic analysis of these grammars.

3.1 General grammars specifying $\mathbb{D}\mathbb{L}$ -histories and $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories

In this section we describe grammars specifying histories evolving within a species tree using the formalism developed in comb/FlajoletS09 . We describe grammars for $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories, for both an unranked and a ranked species tree; these grammars can then be specialized into grammars for $\mathbb{D}\mathbb{L}$ -histories by omitting the rules related to HGT.

Let $\mathbf{S}$ be a species tree. If $\mathbf{S}$ is unranked, it is a binary tree, otherwise, if it comes with a ranking $\pi$ , we consider the unary-binary species tree $\mathbf{S}_{\pi}$ . So in the statements below, when mentioning a ranked species tree we mean the unary-binary tree $\mathbf{S}_{\pi}$ defined by the ranking.

We denote by ${H}_{u}$ the set of $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories for the tree $\mathbf{S}_{u}$ . In the most general setting, following comb/FlajoletS09 , these grammars contain both terminal symbols, corresponding to atomic elements of the histories (nodes) and non-terminal symbols, corresponding to combinatorial operators applied to sets of histories. We use the non-terminal $\mathcal{Z}_{u}$ to encode a gene present in extant species $u$ ; moreover, we use $\mathcal{X}_{u}$ for a gene lost at species $u$ , $\mathcal{Y}_{u}$ for a duplication at species $u$ and $\mathcal{W}_{u}$ for a HGT with donor species $u$ . We consider two combinatorial operators, $\cup$ the disjoint union and $\times$ the Cartesian product.

Theorem 3.1

The set ${H}_{r(\mathbf{S})}$ defined by the grammar below specifies the set of all $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories for a species tree $\mathbf{S}$ .

[TABLE]

where $\overline{C}(u)$ is the set of nodes that are incomparable with $u$ in $\mathbf{S}$ . The set of $\mathbb{D}\mathbb{L}$ -histories is specified by the same grammar where rule (6) is removed and the terms ${T}_{u}$ are removed from rules (1) and (2).

Proof

The grammar follows the definition of histories, Definition 1. Rule (1) simply states that the root (i.e. the first evolutionary event of the history) of a $\mathbb{D}\mathbb{L}\mathbb{T}$ -history within the subtree $\mathbf{S}_{u}$ , assuming it is not reduced to a leaf, is either a speciation, a duplication or a transfer of the ancestral gene present in species $u$ : non-terminal $\mathcal{S}_{u}$ , ${D}_{u}$ and ${T}_{u}$ represent respectively these three subsets of ${H}_{u}$ . Rule (2) addresses the case where $\mathbf{S}_{u}$ is composed of a single leaf, in which case there can not be a speciation event, but a history reduced to a single gene in species $u$ .

Rule (3) describes a speciation event at species $u$ . The ancestral gene can either evolve into a gene in each of the two children of $u$ (first term of the union) or into a gene in a single child of $u$ due to a gene loss in the other child of $u$ . In the case where $u$ is unary (due to being a node created by the time slicing in a ranked $\mathbf{S}$ ), the ancestral gene evolves into a copy in the unique child $u_{c}$ of $u$ .

Rule (5) addresses the case of a duplication. It results in two ordered independent histories starting at species $u$ : the first one being the history of the original copy of the starting ancestral gene and the second one the history rooted at the novel gene created by the duplication.

Last, Rule (6) addresses the case of histories starting by a HGT. Generally, a HGT has a structure similar to a duplication but for the fact that the novel gene appears in a species that is incomparable with $u$ .

These various rules cover all cases for describing the possible first event of a history and are mutually exclusive, thus providing a complete recursive specification of $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories for a given species tree $\mathbf{S}$ . It follows immediately that removing the rule and non-terminals associated to HGT gives a grammar specifying $\mathbb{D}\mathbb{L}$ -histories for $\mathbf{S}$ . ∎

Remark 3

The above grammar can be greatly simplified if one is interested only in the number of histories of a given size, as opposed to the specific species where gene duplication, gene loss and HGT events occur and the precise gene content of extant species. In this case, one simply identifies all non-terminals $\mathcal{Z}_{u}$ (resp. $\mathcal{X}_{u}$ , $\mathcal{Y}_{u}$ , $\mathcal{W}_{u}$ ) to a single variable $\mathcal{Z}$ (resp. $\mathcal{X}$ , $\mathcal{Y}$ , $\mathcal{W}$ ). From now, we follow this approach.

3.2 Counting and sampling algorithms

The grammar defined above can naturally be turned into a dynamic programming algorithm computing the number of histories of a given size. This algorithm computes tables $H,D,S,T$ where, for a given node $u$ of $\mathbf{S}$ and a given history size $n$ , $H[u,n]$ (respectively, $D[u,n]$ , $S[u,n]$ , $T[u,n]$ ) is the number of $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories of size $n$ evolving within $\mathbf{S}_{u}$ (respectively, starting with a duplication, a speciation, and an HGT). We illustrate this in the case of $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories with an unranked species tree $\mathbf{S}$ .

[TABLE]

A random generation algorithm can then be adapted from the counting recurrences, resulting in an instance of the so-called recursive method wilf1977unified . Right-hand sides of the counting equation are split into sums of multiplicative terms. Starting from the initial state $H[r(\mathbf{S}),n]$ , the algorithm randomly chooses a term from the right-hand side of the current state, with probability proportional to its contribution to the counting. When the selected term is a multiplication of two terms, the length $n$ needs to be distributed across the two terms, and a pair of lengths $(m,n-m)$ , is chosen with probability proportional to the associated count. For the sake of performances, the various alternatives can be explored in Boustrophedon order, ensuring an overall $\mathcal{O}(n\log(n))$ worst-case complexity flajolet1994calculus . Recursive calls are then performed over the states associated with the chosen term, until a leaf is chosen (term $\mathbbm{1}$ ). This leads to the following result.

Theorem 3.2

The number of histories of size $n$ constrained by a species tree of size $k$ can be computed in polynomial time $\mathcal{O}(\Phi(n,k))$ and space $\mathcal{O}(\Psi(n,k))$ , where $\Phi(n,k)$ and $\Psi(n,k)$ both depend on the model ( $\mathbb{D}\mathbb{L}$ or $\mathbb{D}\mathbb{L}\mathbb{T}$ ) and the ranked/unranked nature of the species tree, as summarized in Table 1.

The uniform random generation of $h$ histories of size $n$ can be performed in time $\mathcal{O}(\Phi(n,k)+h\cdot\Upsilon(n,k))$ .

3.3 Asymptotic number of histories in the $\mathbb{D}\mathbb{L}$ -model

The grammar given in Theorem 3.1 defines a combinatorial specification of the set of histories for a given species tree in a given evolutionary model. In this section, we derive the asymptotic number of histories in the $\mathbb{D}\mathbb{L}$ -model and use it later on two specific species trees: the caterpillar and complete binary trees. The following theorem is the main result of this section and describes their asymptotic growth for $n$ tending to infinity.

Theorem 3.3

For any given species tree $\mathbf{S}$ , the number of histories in the unranked $\mathbb{D}\mathbb{L}$ -model given by Equations (1)-(5) is, for large $n$ , equal to

[TABLE]

for explicitly computable constants $\gamma_{\mathbf{S}}>0$ and $\rho_{\mathbf{S}}\in(0,1/4]$ .

In the remainder of this section we prove this theorem. The grammars are amenable to enumerative and analytic combinatorics techniques. We follow the general approach presented in Flajolet and Sedgewick comb/FlajoletS09 and Drmota comb/Drmota97 . It consists mainly in translating the combinatorial specification of a combinatorial family into equations defining its counting generating function. Then, its analytic properties lead to precise asymptotic formulas for its coefficients. We provide an overview of this approach in Example 1.

Example 1

Consider the class of rooted binary trees ${B}$ . Such a tree is either a leaf, or it consists of a root with two children which are also each roots of binary trees. Let us mark each leaf with the variable $\mathcal{Z}$ . Then, the grammar is given by

[TABLE]

Let $b_{n}$ be the number of binary trees with $n$ leaves and let $B(z)=\sum_{n\geq 1}b_{n}z^{n}$ be the counting generating function of binary trees. The symbolic method (comb/FlajoletS09, , Part A) translates this grammar directly into an equation for the generating function:

[TABLE]

Its generating function is thus given by $B(z)=\frac{1-\sqrt{1-4z}}{2}.$

The general method of singularity analysis from analytic combinatorics (comb/FlajoletS09, , Chapter VI) allows us to directly get the asymptotics of the coefficients. First, by the Cauchy–Hadamard theorem, the asymptotic growth is directly connected with the dominant singularities (and the radius of convergence) of the counting generating function. Here, the generating function $B(z)$ becomes singular at $z=1/4$ , which is also the unique singular point. Hence, the coefficients $b_{n}$ grow like $4^{n}$ . Second, using transfer theorems of analytic combinatorics (comb/FlajoletS09, , Theorem VI.1 and Theorem VI.3) we also get the subexponential terms and recover the well-known result for Catalan numbers $b_{n+1}=\frac{1}{n+1}\binom{2n}{n}$ (see OEIS A000108 comb/Sloane ):

[TABLE]

for $n\to\infty$ . $\blacksquare$

We will now describe this approach applied to the grammar specifying the $\mathbb{D}\mathbb{L}$ -histories with an unranked species tree $\mathbf{S}$ . Let $h_{u,n}$ be the number of $\mathbb{D}\mathbb{L}$ -histories of $\mathbf{S}_{u}$ consisting of $n$ genes represented in the generating function by the formal variable $z$ . We define the counting generating functions

[TABLE]

The coefficients $h_{u,n}$ represent the number of histories of size $n$ associated with the species tree $\mathbf{S}_{u}$ independent on the number of losses or duplications. These generating functions (one per species $u$ of $\mathbf{S}$ ) are strongly related to the generating function of binary trees $B(z)$ introduced in Example 1.

Lemma 1

For a given species tree $\mathbf{S}$ the counting generating function $H_{r(\mathbf{S})}(z)$ for histories in the unranked $\mathbb{D}\mathbb{L}$ -model is defined by the system of functional equations

[TABLE]

over all nodes $u$ of $\mathbf{S}$ , where

[TABLE]

Proof

The symbolic method (comb/FlajoletS09, , Part A) translates the unranked $\mathbb{D}\mathbb{L}$ -grammar of Equations (1)-(5) directly into a system of equations for the generating functions. We get

[TABLE]

Comparing these equations with the one for binary trees from Equation (13) the claim follows. ∎

The advantage of a generating function approach is that we are able to identify the subexponential growth as $n^{-3/2}$ , and that we are able to explicitly compute exponential growth $\rho_{\mathbf{S}}^{-1}$ and the constant $\gamma_{\mathbf{S}}$ for a fixed species tree $\mathbf{S}$ . We will compute the involved constants explicitly for the caterpillar tree in Section 4.1.1 and for the complete binary tree in Section 4.1.2.

By basic principles of analytic combinatorics, the asymptotic growth of a counting sequence is directly related to the radius of convergence of the corresponding generating function. In particular, its dominant singularity (i.e. the one closest to the origin) defines its asymptotic growth. By the construction in terms of nested radicals, the generating function $H_{u}(z)$ is singular if and only if at least one of its radicals becomes zero. Therefore, we make the structure of nested radicals visible. Writing the explicit form of the outermost $B(z)$ in (14) gives

[TABLE]

Then, the radicands satisfy the following recurrence

[TABLE]

The recurrence can be used to determine the nature of the radii of convergence. For a node $u$ we define $\rho_{u}$ as the radius of convergence of $H_{u}(z)$ .

Lemma 2

Let $u$ be the parent of $v$ in $\mathbf{S}$ . Then, $\rho_{u}<\rho_{v}$ and $\rho_{u}\in(0,1/4]$ with $\rho_{u}=1/4$ if $u$ is a leaf. Furthermore, $R_{u}(z)$ is the only radicand that vanishes at $z=\rho_{u}$ and $\rho_{u}$ is a simple root.

Proof

By combinatorial construction $H_{u}(z)$ is built of nested radicals and does not include any poles. Therefore, its dominant singularity must be at a point where (at least) one of its radicands vanishes.

We continue by induction on the depth of the subtree with root $u$ given by $\mathbf{S}_{u}$ . The depth is the longest path from the root to any leaf. As a first step, we prove that $R_{u}(0)=1$ and that $\rho_{u}\leq 1/4$ . For a leaf $u$ it is clear from Relation (17) that $R_{u}(0)=1$ and that $\rho_{u}=1/4$ .

Next, let $v$ and $w$ be the children of $u$ such that $\rho_{v}\leq\rho_{w}$ . By the induction hypothesis we directly get

[TABLE]

In order to continue, note that $R_{u}(z)$ is monotonically decreasing on $[0,+\infty]$ , because from the decomposition in (16) and (15) we see that

[TABLE]

for certain non-negative numbers $a_{n}$ .

By the induction hypothesis and Relation (17), $R_{u}(z)$ is a continuous function on $(0,\rho_{v})$ . Hence, we get

[TABLE]

Thus, on the one hand, by the intermediate value theorem $R_{u}(z)$ must have at least one zero in the interval $(0,\rho_{v})$ . On the other hand, as $R_{u}(z)$ is monotonically decreasing it has at most one zero in $(0,\rho_{v})$ . Hence, this zero is equal to $\rho_{u}$ .

Finally, the above reasoning implies that among the nested radicals of $H_{u}(z)$ the outermost one is the first one that vanishes, and no other radical vanishes at the same time. Thus, $\rho_{u}$ is the radius of convergence of $H_{u}(z)$ . Moreover, by (18) we see that the derivative $R_{u}^{\prime}(z)$ has non-positive coefficients. Hence, $\rho_{u}$ is a simple root. ∎

Let us shortly digress and discuss in a more general context how to numerically compute the exponential growth for the coefficients of the generating function with the fastest exponential growth that is defined by a system of functional equations involving generating functions ${B}_{1},\dots,{B}_{k}$ of the form

[TABLE]

where the $\Phi_{i}$ are polynomials with non-negative integer coefficients in $k+1$ variables. Note that the grammar given in Theorem 3.1 is of this shape. In order to decide which of the $B_{i}$ ’s has this specific exponential growth, further information on the problem, like in our case given by Lemma 2, is needed. By Banach’s fixed point theorem, these equations admit a unique solution vector $(B_{1},\ldots,B_{k})\in(\mathbb{C}[[z]])^{k}$ with respect to the formal topology (comb/FlajoletS09, , Section A.5). Furthermore, each $B_{i}(z)$ has non-negative coefficients in its expansion around [math] (which is already clear from the combinatorial nature of the problem). Then, the multivariate version of the implicit function theorem implies that each of them has a non-zero radius of convergence which we call $\rho_{i}$ . By Pringsheim’s Theorem (comb/FlajoletS09, , Theorem IV.6), $\rho_{i}\in[0,+\infty]$ is a singularity of $B_{i}(z)$ . Moreover, as $B_{i}(z)$ is an ordinary generating function of an infinite combinatorial class, we must have $\rho_{i}\in[0,1]$ . Finally, in order to compute the radius of convergence, we find the minimal point $z\in[0,1]$ where the implicit function theorem fails. To be more precise, we numerically compute solutions $\rho\in[0,1]$ and $b_{1},\ldots,b_{k}\in[0,+\infty)$ of the following system

[TABLE]

where $\delta_{i,j}$ is the Kronecker symbol: $\delta_{i,i}=1$ , and $\delta_{i,j}=0$ for $i\neq j$ .

Remark 4

The unranked $\mathbb{D}\mathbb{L}$ -grammars lead to the following specific shape

[TABLE]

Hence, we get $\det\left(\delta_{i,j}-\frac{\partial}{\partial b_{j}}\Phi_{i}(\rho,b_{1},\ldots,b_{k})\right)=\prod_{i=1}^{k}(1-2b_{i}).$ We actually know by Lemma 2 that the outermost square-root vanishes, which gives $b_{k}=B_{k}(\rho)=1/2$ . Additionally, we can also directly deduce from this system that $\rho_{k}\leq\rho_{k-1}$ .

In the unranked $\mathbb{D}\mathbb{L}\mathbb{T}$ -model the system looks like

[TABLE]

where the last equation is the only one involving $B_{k}$ , as the root can not be a receiver of an HGT. Note that the subsystem of the first $k-1$ equations is strongly connected and but still not satisfies the $a$ -properness condition (i.e. it is no contraction in the formal topology) of the Drmota–Lalley–Woods Theorem (comb/FlajoletS09, , Theorem VII.6) which would directly imply a square root singularity. Thus, we conjecture that the dominant singularity still comes solely from the outermost square root of $B_{k}$ implying $b_{k}=1/2$ .

In the ranked $\mathbb{D}\mathbb{L}\mathbb{T}$ -model we are dealing with blocks of strongly connected components that correspond to the time slices. Note that the root is contained in a singleton time slice. Experiments suggest the same behavior as in the previous cases.

However, one thing is for sure in all models: we always have $\rho_{r(\mathbf{S})}\leq\rho_{u}$ for all other subtrees with root $u$ of the species tree. Hence, there will be always a dominant minimal singularity in $[0,1]$ that can be (numerically) computed. Note however, that the determinant computation soon becomes extremely heavy.

After determining the radius of convergence, we must determine the number of singularities on it. As shown in the case of $\lambda$ -terms in (comb/BodiniGGG18, , Lemma 8) there can only be one dominant singularity $\rho_{u}$ . Let us quickly repeat this argument here. Assume that there exists a root $z_{0}=\rho_{u}e^{i\theta}$ of the same modules. Substituting this value into $R_{u}(z)$ from (18) gives

[TABLE]

which can only hold if $e^{in\theta}=1$ whenever $a_{n}\neq 0$ . Now, due to $a_{1}\neq 0$ we have $z_{0}=\rho_{u}$ . Hence, $\rho_{u}$ is the unique dominant real singularity of $H_{u}(z)$ .

Combining the previous results, we have shown for a family of constants $\gamma_{u,i}$ the following local singular expansion

[TABLE]

The fact that $R_{u}(z)$ has a simple root at $z=\rho_{u}$ shows that $\gamma_{u,0}>0$ . Then, by transfer theorems of analytic combinatorics (comb/FlajoletS09, , Theorem VI.1 and Theorem VI.3), we get the claimed asymptotic expansion of Equation (12), where $\gamma_{T}=\frac{\gamma_{u,0}}{2\sqrt{\pi}}>0$ and this ends the proof of Theorem 3.3.

Remark 5

There are several possible extensions of the previous approach. First of all, it is straightforward to extend it to the ranked $\mathbb{D}\mathbb{L}$ -model. In that case one only needs to incorporate unary nodes arising from the time slices. Second, an extension to the $\mathbb{D}\mathbb{L}\mathbb{T}$ -model is also possible, yet the computations are more involved as the binary tree structure leading to Lemma 1 does not hold anymore. However, it can still be modeled with colored binary trees, where the number of colors depends on the size of the set of incomparable nodes (in the the current time slice). Third, it is also possible to consider the distribution of certain parameters, such as the number of gene losses, or the number of gene duplications, see e.g. for related results in lattice paths and trees comb/BonaF09 ; comb/GittenbergerJW18 ; comb/BanderierW17 . Using multivariate generating functions and marking each such event by an additional variable like in the general grammar of Theorem 3.1, the above results for the $\mathbb{D}\mathbb{L}$ -model directly generalize to the respective ones on multivariate generating functions. All these generalizations are interesting future research directions.

The counting and sampling algorithms described above have been implemented in Python, and are available at https://github.com/cchauve/DLTcount.

4 Results

Over the next two sections, we will apply Theorem 3.3 to the special cases of the caterpillar and complete species tree in the unranked $\mathbb{D}\mathbb{L}$ -model, and explicitly determine the constants involved in the asymptotic expansion. Then, we apply our dynamic programming counting and sampling algorithms to study properties of random evolutionary histories.

4.1 Asymptotic expansion for extremal species trees in the $\mathbb{D}\mathbb{L}$ -model

Our experimental results (Section 4.2) suggest that for a given $k$ , the species trees having the largest (resp. smallest) number of $\mathbb{D}\mathbb{L}$ -histories are respectively the caterpillar tree and the balanced binary tree (Conjecture 1), defined below. In the present section, our main results are the explicit computation of the asymptotic growth and the leading constant of Theorem 3.3 for the caterpillar species tree (Propositions 1 and 2) and for the complete binary species tree, the special case of balanced trees when $k$ is a power of $2$ (Propositions 3 and 4, see also Table 4.1).

The rooted caterpillar tree $\mathbf{CT}_{k}$ can be defined as follows: $\mathbf{CT}_{1}$ is the tree reduced to a single leaf, while $\mathbf{CT}_{k}$ ( $k>1$ ) is the tree formed by a left subtree equal to $\mathbf{CT}_{k-1}$ and a right subtree equal to $\mathbf{CT}_{1}$ . Observe that every subtree of a caterpillar tree is itself a caterpillar tree, see Figure 4.

The complete binary tree $\mathbf{CB}_{h}$ with $k=2^{h}$ leaves can be defined as follows: $\mathbf{CB}_{0}$ is the tree reduced to a single leaf, while $\mathbf{CB}_{h}$ ( $h\geq 1$ ) is the tree formed by a left and a right subtree both equal to $\mathbf{CB}_{h-1}$ . Observe again that every subtree is itself a complete binary tree, see Figure 4. The complete binary tree is a special case of the class of balanced trees, defined as trees where, for each node, the number of leaves in the left subtree differs from the number of leaves in the right subtree by at most one. Complete binary trees are the only balanced trees in which the number of leaves is a power of two.

We can observe that the number of $\mathbb{D}\mathbb{L}$ -histories grows much faster for the caterpillar tree than for the complete binary tree. This is actually unsurprising given that the number of $\mathbb{D}\mathbb{L}$ -histories can be linked to the size of the grammar, which itself depends on the structure of the species tree. More precisely, the size of the grammar depends on the number of unique subtrees of the considered species tree $S$ . Each such subtree may be identified by its root $u$ and corresponds to one set of rules (1)-(6), while subtrees having the same topology lead to isomorphic subgrammars with the same counting generating functions. The caterpillar (resp. complete binary) tree has the largest (resp. smallest) number of unique subtrees within the set of species trees of the same size (when $k$ is a power of $2$ for the complete binary tree), compare also Table 4.1.

Bibliography52

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Ö. Åkerborg, B. Sennblad, L. Arvestad, and J. Lagergren. Simultaneous bayesian gene tree reconstruction and reconciliation analysis. Proceedings of the National Academy of Sciences , 106(14):5714–5719, 2009.
2[2] L. Arvestad, J. Lagergren, and B. Sennblad. The gene evolution model and computing its associated probabilities. Journal of the ACM , 56(2):7:1–7:44, 2009.
3[3] Y. ban Chan, V. Ranwez, and C. Scornavacca. Inferring incomplete lineage sorting, duplications, transfers and losses with reconciliations. Journal of Theoretical Biology , 432:1–13, 2017.
4[4] C. Banderier and M. Wallner. Lattice paths with catastrophes. Discrete Mathematics & Theoretical Computer Science , Vol 19 no. 1, Sept. 2017. Full version of extended abstract with the same title appeared in the Proceedings of conference on Random Generation of Combinatorial Structures – {GAS Com} 2016.
5[5] M. S. Bansal, E. J. Alm, and M. Kellis. Reconciliation revisited: Handling multiple optima when reconciling with duplication, transfer, and loss. Journal of Computational Biology , 20(10):738–754, 2013.
6[6] M. S. Bansal, M. Kellis, M. Kordi, and S. Kundu. Ranger-dtl 2.0: rigorous reconstruction of gene-family evolution by duplication, transfer and loss. Bioinformatics , 34(18):3214–3216, 2018.
7[7] M. Bendkowski, O. Bodini, and S. Dovgal. Polynomial tuning of multiparametric combinatorial samplers. In Proceedings of the Fifteenth Workshop on Analytic Algorithmics and Combinatorics, ANALCO 2018, New Orleans, LA, USA, January 8-9, 2018. , pages 92–106. SIAM, 2018.
8[8] O. Bodini, D. Gardy, B. Gittenberger, and Z. Gołębiewski. On the number of unary-binary tree-like structures with restrictions on the unary height. Annals of Combinatorics , 22(1):45–91, 2018.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Counting and sampling gene family evolutionary histories in the duplication-loss and duplication-loss-transfer models

Abstract

Keywords:

1 Introduction

2 Model: gene families evolutionary histories

Preliminaries on trees.

Ranked trees.

Gene Families Evolutionary Histories.

Definition 1

Remark 1

Remark 2

Time Consistency of DLT\mathbb{D}\mathbb{L}\mathbb{T}DLT-histories.

3 Methods

3.1 General grammars specifying DL\mathbb{D}\mathbb{L}DL-histories and DLT\mathbb{D}\mathbb{L}\mathbb{T}DLT-histories

Theorem 3.1

Proof

Remark 3

3.2 Counting and sampling algorithms

Theorem 3.2

3.3 Asymptotic number of histories in the DL\mathbb{D}\mathbb{L}DL-model

Theorem 3.3

Example 1

Lemma 1

Proof

Lemma 2

Proof

Remark 4

Remark 5

4 Results

4.1 Asymptotic expansion for extremal species trees in the DL\mathbb{D}\mathbb{L}DL-model

Time Consistency of $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories.

3.1 General grammars specifying $\mathbb{D}\mathbb{L}$ -histories and $\mathbb{D}\mathbb{L}\mathbb{T}$ -histories

3.3 Asymptotic number of histories in the $\mathbb{D}\mathbb{L}$ -model

4.1 Asymptotic expansion for extremal species trees in the $\mathbb{D}\mathbb{L}$ -model