A Faster Construction of Greedy Consensus Trees

Pawe{\l} Gawrychowski; Gad M. Landau; Wing-Kin Sung; Oren Weimann

arXiv:1705.10548·cs.DS·July 5, 2017

A Faster Construction of Greedy Consensus Trees

Pawe{\l} Gawrychowski, Gad M. Landau, Wing-Kin Sung, Oren Weimann

PDF

TL;DR

This paper presents significantly faster algorithms for constructing greedy and frequency difference consensus trees, reducing computational complexity from quadratic to near-linear time in key parameters, thereby improving phylogenetic analysis efficiency.

Contribution

The paper introduces improved algorithms that reduce the running time for computing greedy and frequency difference consensus trees from quadratic to near-linear time complexities.

Findings

01

Greedy consensus tree algorithm improved to O(k n^{1.5})

02

Frequency difference consensus tree algorithm improved to O(k n)

03

Significant reduction in computational complexity for phylogenetic consensus methods

Abstract

A consensus tree is a phylogenetic tree that captures the similarity between a set of conflicting phylogenetic trees. The problem of computing a consensus tree is a major step in phylogenetic tree reconstruction. It also finds applications in predicting a species tree from a set of gene trees. This paper focuses on two of the most well-known and widely used oconsensus tree methods: the greedy consensus tree and the frequency difference consensus tree. Given $k$ conflicting trees each with $n$ leaves, the previous fastest algorithms for these problems were $O (k n^{2})$ for the greedy consensus tree [J. ACM 2016] and $\tilde{O} (min {k n^{2}, k^{2} n})$ for the frequency difference consensus tree [ACM TCBB 2016]. We improve these running times to $\tilde{O} (k n^{1.5})$ and $\tilde{O} (k n)$ respectively.

Equations2

(x + y) lo g (x + y - 1) - y lo g y - x lo g (x - 1) = x lo g (1 + y / (x - 1)) + y lo g (1 + (x - 1) / y)

(x + y) lo g (x + y - 1) - y lo g y - x lo g (x - 1) = x lo g (1 + y / (x - 1)) + y lo g (1 + (x - 1) / y)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

A Faster Construction of Greedy Consensus Trees

Paweł Gawrychowski

University of Haifa, Israel

Gad M. Landau

University of Haifa, Israel

Wing-Kin Sung

National University of Singapore, Singapore

Oren Weimann

University of Haifa, Israel

Abstract

A consensus tree is a phylogenetic tree that captures the similarity between a set of conflicting phylogenetic trees. The problem of computing a consensus tree is a major step in phylogenetic tree reconstruction. It also finds applications in predicting a species tree from a set of gene trees. This paper focuses on two of the most well-known and widely used consensus tree methods: the greedy consensus tree and the frequency difference consensus tree. Given $k$ conflicting trees each with $n$ leaves, the previous fastest algorithms for these problems were $\mathcal{O}(kn^{2})$ for the greedy consensus tree [J. ACM 2016] and $\tilde{\mathcal{O}}(\min\{kn^{2},k^{2}n\})$ for the frequency difference consensus tree [ACM TCBB 2016]. We improve these running times to $\tilde{\mathcal{O}}(kn^{1.5})$ and $\tilde{\mathcal{O}}(kn)$ respectively.

1 Introduction

A phylogenetic tree describes the evolutionary relationships among a set of $n$ species called taxa. It is an unordered rooted tree whose leaves represent the taxa and whose inner nodes represent their common ancestors. Each leaf has a distinct label from $[n]$ . The inner nodes are unlabeled and have at least two children.

Numerous phylogenetic trees, reconstructed from data sources like fossils or DNA sequences, have been published in the literature since the early 1860s. However, the phylogenetic trees obtained from different data sources or using different reconstruction methods result in conflicts (similar though not identical phylogenetic trees over the same set $[n]$ of leaf labels). The conflicts between phylogenetic trees are usually measured by their difference in signatures: The signature of a phylogenetic tree $T$ is the set $\{\mathsf{L}(u):u\in T\}$ where $\mathsf{L}(u)$ denotes the set of labels of all leaves in the subtree rooted at node $u$ of $T$ (the set $\mathsf{L}(u)$ is sometimes called a cluster). To deal with the conflicts between $k$ phylogenetic trees in a systematic manner, the concept of a consensus tree was invented. Informally, the consensus tree is a single phylogenetic tree that summarizes the branching structure (signatures) of all the conflicting trees. Consensus trees have been widely used in two applications:

Constructing a phylogenetic tree: First, by sampling the dataset, we generate $k$ different datasets (for some constant $k$ that can be as large as $10,000$ ). Then, we reconstruct one phylogenetic tree for each dataset. Finally, we build the consensus tree of these $k$ trees. 2. 2.

Constructing a species tree: First, a phylogenetic tree (called a gene tree) is reconstructed for each individual gene. Then, the species tree is created by building the consensus tree of all $k$ gene trees.

Many different types of consensus trees have been proposed in the literature. For almost all of them, optimal or near-optimal $\tilde{\mathcal{O}}(kn)$ time constructions are known. These include Adam’s consensus tree [1], strict consensus tree [27], loose consensus tree [4, 13], majority-rule consensus tree [17, 13], majority-rule (+) consensus tree [11], and asymmetric median consensus tree [20, 21]111Constructing the asymmetric median consensus tree was proven to be NP-hard for $k>2$ [20] and solvable in $\tilde{\mathcal{O}}(n)$ time for $k=2$ [21].. Two of the most notable exceptions are the frequency difference consensus tree [10] and the greedy consensus tree [5, 9] whose running time remains quadratic in either $k$ or $n$ . In particular, the former can be constructed in $\tilde{\mathcal{O}}(\min\{kn^{2},k^{2}n\})$ time [11] and the later in $\mathcal{O}(kn^{2})$ time [13]. For more details about different consensus trees and their advantages and disadvantages see the survey in [5], Chapter 30 in [8], and Chapter 8.4 in [31].

In this paper we propose novel algorithms for the frequency difference consensus tree problem and the greedy consensus tree problem. First, we present an $O(kn\log^{2}n)$ time deterministic labeling method. The labelling method counts the frequency (number of occurrences) of every cluster $S$ in the input trees. Based on this labeling method, we obtain an $O(kn\log^{2}n)$ time construction of the frequency difference consensus tree. Then, for the greedy consensus tree, we present our main technical contribution: a method that uses micro-macro decomposition to verify if a cluster $S$ is compatible with a tree $T$ in $\mathcal{O}(n^{0.5}\log n)$ time and, if so, modify $T$ to include $S$ in $\mathcal{O}(n^{0.5}\log n)$ amortized time. Using this procedure, we obtain an $\mathcal{O}(kn^{1.5}\log n)$ time construction of the greedy consensus tree.

The frequency difference consensus tree.

The frequency $f(S)$ of a cluster $S$ (a set of labels of all leaves in some subtree) is the number of trees that contain $S$ . A cluster is said to be compatible with another cluster if they are either disjoint or one is included in the other. A frequent cluster is a cluster that occurs in more trees than any of the clusters that are incompatible with it. The frequency difference consensus tree is a tree whose signature is exactly all the frequent clusters.

The frequency difference consensus tree was initially proposed by Goloboff et al. [10], and its relationship with other consensus trees was studied in [7]. In particular, it can be seen as a refinement of the majority-rule consensus tree [17, 13]. Moreover, it is known to give less noisy branches than the greedy consensus tree defined below. Steel and Velasco [30] concluded that “the frequency difference method is worthy of more widespread usage and serious study”. A naive construction of the frequency difference consensus tree takes $\mathcal{O}(k^{2}n^{2})$ time. The free software TNT [10] has implemented a heuristics method to construct it more efficiently. However, its time complexity remains unknown.

Recently, Jansson et al. [11] presented an $\mathcal{O}(\min\{kn^{2},k^{2}n+kn\log^{2}n\})$ time construction (implemented in the FACT software package [12]). Their algorithm first computes the frequency $f(S)$ of every cluster $S$ with non-zero frequency. This is done in total $\mathcal{O}(\min\{kn^{2},k^{2}n\})$ time. They then show that given these computed frequencies, the frequency difference consensus tree can be computed in additional $\mathcal{O}(kn\log^{2}n)$ time. In Section 2 we show how to compute all frequencies in total $\mathcal{O}(kn\log^{2}n)$ time leading to the following theorem:

Theorem 1.

The frequency difference consensus tree of $k$ phylogenetic trees $T_{1},T_{2},\ldots,T_{k}$ on the same set of leaves $[n]$ can be computed in $\mathcal{O}(kn\log^{2}n)$ time.

To prove the above theorem, we first develop an $\mathcal{O}(kn\log^{2}n)$ time algorithm for assigning a number $\mathsf{id}(u)\in[kn]$ to every $u\in T_{i}$ such that $\mathsf{id}(u)=\mathsf{id}(u^{\prime})$ iff $\mathsf{L}(u)=\mathsf{L}(u^{\prime})$ . With these numbers in hand, we can then compute the frequencies of all clusters in $\mathcal{O}(kn)$ time using counting sort (since there are only $kn$ clusters with non-zero frequencies, and each was assigned an integer bounded by $kn$ ). Notice that this also generates a sorted list of all clusters with non-zero frequencies.

The greedy consensus tree.

We say that a given collection $\mathcal{C}$ of subsets of $[n]$ is consistent if there exists a phylogenetic tree $T$ such that the signature of $T$ is exactly $\mathcal{C}$ . The greedy consensus tree is defined by the following procedure: We begin with an initially empty $\mathcal{C}$ and then consider all clusters $S$ in decreasing order of their frequencies. In this order, for every $S$ , we check if $\mathcal{C}\cup\{S\}$ is consistent, and if so we add $S$ to $\mathcal{C}$ .

The greedy consensus tree is one of the most well-known consensus trees. It has been used in numerous papers such as [6, 23, 14, 18, 3, 24, 29, 19, 2, 15, 16, 26, 33] to name a few. For example, in a recent landmark paper in Nature [23], it was used to construct the species tree from 1000 gene trees of yeast genomes, and in [6] it was asserted that “The greedy consensus tree offers some robustness to gene-tree discordance that may cause other methods to fail to recover the species tree. In addition, the greedy consensus method outperformed our other methods for branch lengths outside the too-greedy zone.”.

The greedy consensus tree is an extension of the majority-rule consensus tree, and is sometimes called the extended majority-rule consensus (eMRC) tree. It is implemented in popular phylogenetics software packages like PHYLIP [9], PAUP* [32], MrBayes [22], and RAxML [28]. A naive construction of the greedy consensus tree requires $\mathcal{O}(kn^{3})$ time [5]. To speed this up, these software packages use some forms of randomization methods. For example, PHYLIP uses hashing to improve the running time. Even with randomization, the time complexities of these solutions are not known. Recently, Jansson et al. [13] gave the best known provable construction with an $\mathcal{O}(kn^{2})$ deterministic running time (their implementation is also part of the FACT package). In Section 3 we present our main contribution, a deterministic $\tilde{\mathcal{O}}(kn^{1.5})$ construction as stated by the following theorem:

Theorem 2.

The greedy consensus tree of $k$ phylogenetic trees $T_{1},T_{2},\ldots,T_{k}$ on the same set of leaves $[n]$ can be computed in $\mathcal{O}(kn^{1.5}\log n)$ time.

To prove the above theorem, we develop a generic procedure that takes any ordered list of clusters $S_{1},S_{2},\ldots,S_{\ell}\subseteq[n]$ and tries adding them one-by-one to the current solution $\mathcal{C}$ . We assume that every cluster $S_{i}$ is specified by providing a tree $T_{i}$ and a node $u_{i}\in T_{i}$ such that $S_{i}=\mathsf{L}(u_{i})$ . Our procedure requires $\mathcal{O}(n^{0.5}\log n)$ time per cluster (to add this cluster to $\mathcal{C}$ or assert that it cannot be added) and needs not to assume anything about the order of the clusters. In particular, it does not rely on the clusters being sorted by frequencies.

2 Computing the Identifiers

We process the nodes of every $T_{i}$ in the bottom-up order. For every node $u\in T_{i}$ , we compute the identifier $\mathsf{id}(u)$ by updating the following structure called the dynamic set equality structure:

Lemma 1.

There exists a dynamic set equality structure that supports: (1) create a new empty structure in constant time, (2) add $x\in[n]$ to the current set in $\mathcal{O}(\log^{2}n)$ time, (3) return the identifier of the current set in constant time, and (4) list all $\ell$ elements of the current set in $\mathcal{O}(\ell)$ time. The structure ensures that the identifiers are bounded by the total number of update operations performed so far, and that two sets are equal iff their identifiers are equal.

Proof.

To allow for listing all elements of the current set $S$ , we store them in a list. Before adding the new element $x$ to the list, we need to check if $x\in S$ . This will be done using the representation described below.

Conceptually, we work with a complete binary tree $B$ on $n$ leaves labelled with $0,1,\ldots,n-1$ when read from left to right (without losing generality, $n=2^{k}$ ), where every node $u$ corresponds to a set $D(u)\subseteq[n]$ defined by the leaves in its subtree (note that $D(u)=\{i,i+1,\ldots,j\}$ , where $0\leq i\leq j<n$ ). Now, any set $S$ is associated with a binary tree $B$ , where we write $1$ in a leaf if the corresponding element belongs to $S$ and [math] otherwise. Then, for every node we define its characteristic vector by writing down the values written in the leaves of its subtree in the natural order (from left to right). Clearly, the vector of an inner node is obtained by concatenating the vector of its children. We want to maintain identifiers of all nodes, so that the identifiers of two nodes are equal iff their characteristic vectors are identical. If we can keep the identifiers small, then the identifier of the current set can be computed as the identifiers of the root of $B$ .

Assume that we have already computed the identifiers of all nodes in $B$ and now want to add $x$ to $S$ . This changes the value in the leaf $u$ corresponding to $x$ and, consequently, the characteristic vectors of all ancestors of $u$ . However, it does not change the characteristic vectors of any other node. Therefore, we traverse the ancestors of $u$ starting from $u$ and recompute their identifiers. Let $v$ be the current node. If we have never seen the characteristic vector of $v$ before, we can set the identifier of $v$ to be the largest already used identifier plus one. Otherwise, we have to set the identifier of $v$ to be the same as the one previously used for a node with such a characteristic vector. As mentioned above, the characteristic vector of an inner node $v$ is the concatenation of the characteristic vectors of its children $v_{\ell}$ and $v_{r}$ . We maintain a dictionary mapping a pair consisting of the identifier of $v_{\ell}$ and the identifier of $v_{r}$ to the identifier of $v$ . The dictionary is global, that is, shared by all instances of the structure. Then, assuming that we have already computed the up-to-date identifiers of $v_{\ell}$ and $v_{r}$ , we only need to query the dictionary to check if the identifier of $v$ should be set to the largest already used identifier plus one (which is exactly when the dictionary does not contain the corresponding pair) or retrieve the appropriate identifier. Therefore, adding $x$ to $B$ reduces to $\log n$ queries to the dictionary. By implementing the dictionary with balanced search trees, we therefore obtain the claimed $\mathcal{O}(\log^{2}n)$ time for adding an element.

We are not completely done yet, because creating a new complete binary tree $B$ takes $\mathcal{O}(n)$ time and therefore the initialization time is not constant yet. However, we can observe that it does not make sense to explicitly maintain a node $u$ of $B$ such that $S\cap D(u)=\emptyset$ , because we can assume that the identifier of such an $u$ is 0. In other words, we can maintain only the part of $B$ induced by the leaves corresponding to $S$ . Adding an element $x\in S$ is implemented as above, except that we might need to create (at most $\mathcal{O}(\log n)$ ) new nodes on the leaf-to-root path corresponding to $x$ (if such a leaf already exists, we terminate the procedure as $x\in S$ already) and then recompute the identifiers on the whole path as described above. ∎

Armed with Lemma 1, we process every $T_{i}$ bottom-up. Consider an inner node $v\in T_{i}$ and let $v_{1},v_{2},\ldots,v_{d}$ be its children ordered so that $|\mathsf{L}(v_{1})|=\max_{j}|\mathsf{L}(v_{j})|$ , that is, the subtree rooted at $v_{1}$ is the largest. Assuming that we have already stored every $\mathsf{L}(v_{j})$ in a dynamic set equality structure, we construct a dynamic set equality structure storing $\mathsf{L}(v)$ by simply inserting all elements of $\mathsf{L}(v_{2})\cup\mathsf{L}(v_{3})\cup\cdots\cup\mathsf{L}(v_{d})$ into the structure of $\mathsf{L}(v_{1})$ . This takes $\mathcal{O}(\log^{2}n)$ time per element. Then, we set $\mathsf{id}(u)$ to be the identifier of the obtained structure. By a standard argument (heavy path decomposition), every leaf of $T_{i}$ is inserted into at most $\log n$ structures and therefore the whole $T_{i}$ is processed in $\mathcal{O}(n\log^{3}n)$ time. This gives us the claimed $\mathcal{O}(kn\log^{3}n)$ total time.

We now proceed with a faster $\mathcal{O}(kn\log^{2}n)$ total time solution. While this is irrelevant for our $\mathcal{O}(kn^{1.5}\log n)$ time construction of the greedy consensus tree, it implies a better complexity for constructing the frequency difference consensus tree.

We start with a high-level intuition. Lemma 1 is, in a sense, more than we need, as it is not completely clear that we need to immediately compute the identifier of the current set. Indeed, applying heavy path decomposition we can partially delay computing the identifiers by proceeding in $\mathcal{O}(\log n)$ phases. In each phase, we can then replace the dynamic dictionary used to store the mapping with a radix sort. Intuitively, this shaves one log from the time complexity. We proceed with a detailed explanation.

Theorem 3.

The numbers $\mathsf{id}(u)$ can be found for all nodes of the $k$ phylogenetic trees $T_{1},T_{2},\ldots,T_{k}$ in $\mathcal{O}(kn\log^{2}n)$ total time.

Proof.

For a node $v\in T_{i}$ , define its level $\mathsf{level}(v)$ to be $\ell$ , such that $2^{\ell}\leq|\mathsf{L}(v)|<2^{\ell+1}$ . Thus, the levels are between [math] and $\log n$ , level of a node is at least as large as the levels of its children, and a node on level $\ell$ has at most one child on the same level. We work in phases $\ell=0,1,\ldots,\log n$ . In phase $\ell$ , we assume that the numbers $\mathsf{id}(v)$ are already known for all nodes $v$ , such that $\mathsf{level}(v)<\ell$ , and want to assign these numbers to all nodes $v$ , such that $\mathsf{level}(v)=\ell$ . We will show how to achieve this in $\mathcal{O}(kn\log n)$ time, thus proving the theorem.

Consider all nodes $v$ , such that $\mathsf{level}(v)=\ell$ . Because every such $v$ has at most one child at the same level, all level- $\ell$ nodes in $T_{i}$ can be partitioned into maximal paths of the form $p=v_{1}-v_{2}-\ldots-v_{s}$ , where the level of the parent of $v_{1}$ is larger than $\ell$ (or $v_{1}$ is the root of $T_{i}$ ), and the levels of all children of $v_{j}$ (except for $v_{j+1}$ , if defined) are smaller than $\ell$ . $v_{1}$ is called the head of $p$ and denoted $\mathsf{head}(p)$ . Now, our goal is to find $\mathsf{id}(v_{j})$ with the required properties for every $j=1,2,\ldots,s$ . We will actually achieve a bit more. The sets $\mathsf{L}(\mathsf{head}(p))$ are disjoint in every tree $T_{i}$ , and thus we can define, for every $i$ , a partition $\mathcal{P}_{i}=\{P_{i}(1),P_{i}(2),\ldots,P_{i}(t_{i})\}$ of the set of leaves $[n]$ , where every $P_{i}(z)$ corresponds to a level- $\ell$ path $p=v_{1}-v_{2}-\ldots-v_{s}$ in $T_{i}$ , such that $\mathsf{L}(\mathsf{head}(p))=P_{i}(z)$ . The elements of $P_{i}(z)$ are then ordered, and we think that $P_{i}(z)$ is a sequence of length $|P_{i}(z)|$ . The ordering is chosen so that, for every $j=1,2,\ldots,s$ , the set $\mathsf{L}(v_{j})$ corresponds to some prefix of $P_{i}(z)$ . $P_{i}(z)[1..r]$ denotes the prefix of $P_{i}(z)$ of length $r$ . We will assign identifiers to all such prefixes $P_{i}(z)[1..r]$ , for every $i=1,2,\ldots,k$ , $z=1,2,\ldots,t_{i}$ and $r=1,2,\ldots,|P_{i}(z)|$ , with the property that the identifiers of two prefixes are equal iff the sets of leaves appearing in both of them are equal. Then, we can extract the required $\mathsf{id}(v_{j})$ in constant time each by taking the identifiers of some $P_{i}(z)[1..r]$ .

Recall that in the slower solution we worked with a complete binary tree $B$ on $n$ leaves. For every set $S$ in the collection and every $u\in B$ , we computed an identifier of the set $S\cap D(u)$ . This was possible, because if $u_{\ell}$ and $u_{r}$ are the left and the right child of $u$ , respectively, then the identifier of $S\cap D(u)$ can be found using the identifiers of $S\cap D(u_{\ell})$ and $S\cap D(u_{r})$ . We need to show that retrieving these identifiers can be batched.

Fix a node $u\in B$ and, for every $i=1,2,\ldots,k$ and $z=1,2,\ldots,t_{i}$ , consider all prefixes $P_{i}(z)[1..r]$ for $r=1,2,\ldots,|P_{i}(j)|$ . We create a version of $u$ for every such prefix. The version corresponds to the set containing all elements of $D(u)$ occurring in the prefix $P_{i}(z)[1..r]$ . We want to assign identifiers to all versions of $u$ . First, observe that we only have to create a new version if $P_{i}(z)[r]\in D(u)$ , as otherwise the set is the same as for $r-1$ . Thus, the total number of required versions, when summed over all nodes $u\in B$ on the same depth in $B$ , is only $kn$ , as a leaf of $T_{i}$ creates exactly new version for some $u$ . For every node $u\in B$ , we will store a list of all its versions. A version consists of its identifier (such that the identifier of two versions is the same iff the corresponding sets are equal) together with the indices $i$ , $z$ and $r$ . We describe how to create such a list for every node $u\in B$ at the same depth $d$ given the lists for all nodes at depth $d+1$ next.

Let $u_{1}$ and $u_{2}$ be the left and the right child of $u\in B$ , respectively. Then, we need to create a new version of $u$ for every new version of $u_{1}$ and every new version of $u_{2}$ , because for the set corresponding to $u$ to change either the set corresponding to $u_{1}$ or the set corresponding to $u_{2}$ must change, and every change is adding one new element. Fix $i$ and $z$ and consider all versions of $u_{1}$ corresponding to $i$ and $z$ sorted according to $r$ . Let the sorted list of their $r$ ’s be $a_{1}<a_{2}<\ldots$ . Similarly, consider all versions of $u_{r}$ corresponding to $i$ and $z$ sorted according to $r$ , and let the sorted list of their $r$ ’s be $b_{1}<b_{2}<\ldots$ . For every $x\in\{a_{1},a_{2},\ldots\}\cup\{b_{1},b_{2},\ldots\}$ , we create a new version of $u$ corresponding to $i$ , $z$ , and $r$ equal to $x$ . This is done by retrieving the version of $u_{1}$ with $r$ equal to $a_{p}$ , such that $a_{p}\leq x$ and $p$ is maximized, and the version of $u_{2}$ with $r$ equal to $b_{q}$ , such that $b_{q}\leq x$ and $q$ is maximized. Then, the identifier of the new version of $u$ can be constructed from the pair consisting of the identifiers of these versions of $u_{1}$ and $u_{2}$ (this is essentially the same reasoning as in the slower solution). We could now use a dictionary to map these pairs to identifiers. However, we can also observe that, in fact, we have reduced finding the identifiers of all versions of all nodes $u\in B$ at the same depth $d$ to identifying duplicates on a list of $kn$ pairs of numbers from $[kn]$ . This can be done by radix sorting all pairs in linear time (more precisely, $\mathcal{O}(kn)$ time and $\mathcal{O}(kn)$ space), and then sweeping through the sorted list while assigning the identifiers. This takes only $\mathcal{O}(kn)$ time for every depth $d$ , so $\mathcal{O}(kn\log n)$ for every level as claimed. ∎

The proof of Theorem 1 follows immediately from Theorem 3.

3 Simulating the Greedy Algorithm

We consider $k$ trees $T_{1},\ldots,T_{k}$ on the same set of leaves $[n]$ , and assume that every node $u$ has an identifier $\mathsf{id}(u)$ such that $\mathsf{id}(u)=\mathsf{id}(u^{\prime})$ iff $\mathsf{L}(u)=\mathsf{L}(u^{\prime})$ . We next develop a general method for maintaining a solution $\mathcal{C}$ (i.e., a set of compatible identifiers) so that, given any node $u\in T_{i}$ , we are able to efficiently check if $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ , meaning that $\mathcal{C}\cup\mathsf{L}(u)$ is consistent, and if so add $\mathsf{L}(u)$ to $\mathcal{C}$ . Our method does not rely on the order in which the sets arrive and in particular can be used to run the greedy algorithm.

We represent $\mathcal{C}$ with a phylogenetic tree $T_{c}$ such that $\mathcal{C}=\{\mathsf{L}(u):u\in T_{c}\}$ . $T_{c}$ is called the current consensus tree. By Lemma 2.2 of [13], $S$ is compatible with $\mathcal{C}$ iff there exists a node $v\in T$ such that for every child $v^{\prime}$ of $v$ either $\mathsf{L}(v^{\prime})\cap S=\emptyset$ or $\mathsf{L}(v^{\prime})\subseteq S$ . Also, adding $\mathsf{L}(u)$ to $\mathcal{C}$ can be done by creating a new child $w$ of $v$ and reconnecting every original child $v^{\prime}$ of $v$ such that $\mathsf{L}(v^{\prime})\subseteq S$ to the new $w$ . This is illustrated in Figure 1.

Initially, $T_{c}$ consists only of $n$ leaves attached to the common root (which corresponds to $\mathcal{C}=\{\{x\}:x\in[n]\}$ ). Our goal is to to maintain some additional information so that given any node $u\in T_{i}$ , we can check if $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ in $\mathcal{O}(n^{0.5}\log n)$ time. After adding $\mathsf{L}(u)$ to $\mathcal{C}$ the information will be updated in amortized $\mathcal{O}(kn^{0.5}\log n)$ time. To explain the intuition, we first show how to check if $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ in roughly $\mathcal{O}(|\mathsf{L}(u)|)$ time.

Let $\mathsf{L}(u)=\{\ell_{1},\ell_{2},\ldots,\ell_{s}\}$ and let $u_{i}$ be the leaf of $T_{c}$ labelled with $\ell_{i}$ . Then, $v$ must be an ancestor of every $u_{i}$ . We claim that, in fact, $v$ should be chosen as the lowest common ancestor of $u_{1},u_{2},\ldots,u_{s}$ , because if all $u_{i}$ ’s are in the same subtree rooted at a child $v^{\prime}$ of $v$ then we can as well replace $v$ with $v^{\prime}$ . So, we can find $v$ by asking $s-1$ lca queries: we start with $u_{1}$ and then iteratively jump to the lca of the current node and $u_{i}$ . Assuming that we represent $T_{c}$ in such a way that an lca query can be answered efficiently, this takes roughly $\mathcal{O}(s)$ time. Then, we need to decide if for every child $v^{\prime}$ of $v$ it holds that $\mathsf{L}(v^{\prime})\subseteq\mathsf{L}(u)$ or $\mathsf{L}(v^{\prime})\cap\mathsf{L}(u)=\emptyset$ . This can be done by computing, for every such $v^{\prime}$ , how many $u_{i}$ ’s belong to the subtree rooted at $v^{\prime}$ , and then checking if this number is either 0 or $|\mathsf{L}(v^{\prime})|$ . To compute these numbers, we maintain a counter for every $v^{\prime}$ . Then, for every $u_{i}$ we retrieve the child $v^{\prime}$ of $v$ such that $u_{i}$ belongs to the subtree rooted at $v^{\prime}$ and increase the counter of $v^{\prime}$ . Assuming that we represent $T_{c}$ so that such $v^{\prime}$ can be retrieved efficiently, this again takes roughly $\mathcal{O}(s)$ time. Finally, we iterate over all $u_{i}$ again, retrieve the corresponding $v^{\prime}$ and check if its counter is equal to $|\mathsf{L}(v^{\prime})|$ (so our representation of $T_{c}$ should also allow retrieving the number of leaves in a subtree). If not, then $\mathsf{L}(u)$ is not compatible with $\mathcal{C}$ , see Figure 2. Otherwise, we create the new node $w$ and reconnect to $w$ all children $v^{\prime}$ of $v$ , such that the counter of $v^{\prime}$ is equal to $|\mathsf{L}(v^{\prime})|$ .

We would like to avoid explicitly iterating over all elements of $\mathsf{L}(u)$ . This will be done by maintaining some additional information, so that we only have to iterate over up to $n^{0.5}$ elements. To explain what is the additional information we need the (standard) notion of a micro-macro decomposition. Let $b$ be a parameter and consider a binary tree on $n$ nodes. We want to partition it into $\mathcal{O}(n/b)$ node-disjoint subtrees called micro trees. Each micro tree is of size at most $b$ and contains at most two boundary nodes that are adjacent to nodes in other micro trees. One of these boundary nodes, called the top boundary node, is the root of the whole micro tree, and the other is called the bottom boundary node. Such a partition is always possible and can be found in $\mathcal{O}(n)$ time.

We binarize every $T_{i}$ to obtain $T^{\prime}_{i}$ . Then, we find a micro-macro decomposition of $T^{\prime}_{i}$ with $b=n^{0.5}$ . By properties of the decomposition we have the following:

Proposition 1.

For any $u\in T_{i}$ such that $|\mathsf{L}(u)|>n^{0.5}$ , there exists a boundary node $v\in T^{\prime}_{i}$ such that $\mathsf{L}(u)$ can be obtained by adding at most $n^{0.5}$ elements to $\mathsf{L}(v)$ . Furthermore, $v$ and these up to $n^{0.5}$ elements can be retrieved in $\mathcal{O}(n^{0.5})$ time after $\mathcal{O}(n)$ preprocessing.

The total number of boundary nodes is only $\mathcal{O}(kn^{0.5})$ . For each such boundary node $u$ , we maintain a pointer to a node $\mathsf{finger}(u)\in T_{c}$ called the finger of $u$ . $\mathsf{finger}(u)$ is a node $v\in T_{c}$ such that $\mathsf{L}(u)\subseteq\mathsf{L}(v)$ but, for every child $v_{i}$ of $v$ , $\mathsf{L}(u)\not\subseteq\mathsf{L}(v_{i})$ .

Proposition 2.

The node $\mathsf{finger}(u)$ is the lowest common ancestor in $T_{c}$ of all leaves with labels belonging to $\mathsf{L}(u)$ .

Additionally, the children of $\mathsf{finger}(u)$ are partitioned into three groups: (1) $v_{i}$ such that $\mathsf{L}(v_{i})\subseteq\mathsf{L}(v)$ , (2) $v_{i}$ such that $\mathsf{L}(v_{i})\cap\mathsf{L}(v)=\emptyset$ , and (3) the rest. We call them full, empty, and mixed, respectively (with respect to $u$ ). For each group we maintain a list storing all nodes in the group, every node knows its group, and the group knows it size. Additionally, every group knows the total number of leaves in all subtrees rooted at its nodes.

We also need to augment the representation $T_{c}$ to allow for efficient extended lca queries. The lowest common ancestor (lca) of $u$ and $v$ is the leafmost node $w$ that is an ancestor of both $u$ and $v$ . An extended lca query, denoted $\mathsf{lca\_ext}(u,v)$ , returns the first edge on the path from the lca of $u$ and $v$ to $u$ , and -1 if $u$ is an ancestor of $v$ . For example, in Figure 2, $\mathsf{lca\_ext}(v,k)=-1$ whereas $\mathsf{lca\_ext}(n,k)$ is the edge between $v$ and its leftmost child.

Lemma 2.

We can maintain a collection of rooted trees under: (1) create a new tree consisting of a single node, (2) make the root of one tree a child of a node in another tree, (3) delete an edge from a node to its parent, (4) count leaves in the tree containing a given node, and (5) extended lca queries, all in $\mathcal{O}(\log n)$ amortized time, where $n$ is the total size of all trees in the collection.

Proof.

We apply the link/cut trees of Sleator and Tarjan [25] to maintain the collection. This immediately gives us the first three operations. To implement computing the size and $\mathsf{lca\_ext}(u,v)$ queries we need to explain the internals of link/cut trees. Each tree is partitioned into node-disjoint paths consisting of preferred edges. Each node has at most one such edge leading to its preferred child. For each maximal path consisting of preferred edges, called a preferred path, we store its nodes in a splay tree, where the left-to-right order on the nodes of the splay tree corresponds to the top-bottom order on the nodes in the rooted tree. Each such splay tree stores a pointer to the topmost node of its preferred path. Additionally, each node of the tree stores a pointer to its current parent. All operations on a link/cut tree use the access procedure. Its goal is to change the preferred edges so that there is a preferred path starting at the root and ending at $v$ . This is done by first shortening the preferred path containing $v$ so that it ends at $v$ . Then, we iteratively jump to the topmost node $u$ of the current preferred path and make $u$ the preferred child of its parent. Whenever the preferred child of a node changes, we need to update the splay tree representing the nodes of the preferred path. Even though the number of jumps might be $\Omega(n)$ , it can be shown that all these updates take $\mathcal{O}(\log n)$ amortized time.

Now we can explain how to implement $\mathsf{lca\_ext}(u,v)$ . First, we access node $v$ . This gives us a preferred path starting at the root and ending at $v$ . Second, we access node $u$ while keeping track of the topmost nodes of the visited preferred paths. If $u$ is on the same preferred path as $v$ , then $u$ is an ancestor of $v$ . Otherwise, let $p$ be the preferred path visited just before reaching the preferred path starting at the root of the whole tree. Then the topmost node of $p$ (before changing the preferred child of its parent) should be returned as $\mathsf{lca\_ext}(u,v)$ . Thus, the complexity of $\mathsf{lca\_ext}(u,v)$ is the same as the complexity of access.

To compute the size of a tree, we augment the splay trees. Every node of a preferred path stores the total number of leaves in all subtrees attached to it through non-preferred edges (plus one if the node itself is a leaf). Additionally, every node of a splay tree stores the sum of the numbers stored in its subtree, or in other words the total number of leaves in all subtrees attached to its corresponding contiguous fragment of the preferred path through non-preferred edges. The sums stored at the nodes of the splay tree are easily maintained during rotations. We also need to update the total number of leaves after making a preferred edge non-preferred or vice versa. This is easily done by accessing the sum stored at the root of the splay tree. To access the number the leaves in the tree containing $v$ , we need to access $v$ . This makes all of $v$ ’s children non-preferred and makes $v$ the root of its splay tree. Hence, the number stored at $v$ is the total number of leaves in the tree containing $v$ . ∎

We next show how to efficiently check for any $u$ if $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ . By the following lemma, this can be done in $\mathcal{O}(n^{0.5}\log n)$ time, assuming we have stored the required additional information. Recall that this additional information includes:

The tree $T_{c}$ maintained using Lemma 2. 2. 2.

For every boundary node $w$ , we store $\mathsf{finger}(w)$ . 3. 3.

For every boundary node $w$ , we store three lists containing the full, the mixed, and the empty children of $w$ respectively. Each list also stores the total number of leaves in all subtrees rooted at its nodes.

Lemma 3.

Assuming access to the above additional information, given any node $u\in T_{i}$ we can check if $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ in $\mathcal{O}(n^{0.5}\log n)$ time.

Proof.

By Lemma 2.2 of [13], to check if $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ we need to check if there exists a node $v$ such that for every child $v^{\prime}$ of $v$ either $\mathsf{L}(v^{\prime})\cap\mathsf{L}(u)=\emptyset$ or $\mathsf{L}(v^{\prime})\subseteq\mathsf{L}(u)$ . First, observe that $v$ can be chosen as the lowest common ancestor of all leaves with labels belonging to $\mathsf{L}(u)$ . By properties of the micro-macro decomposition, we can retrieve a boundary node $w$ and a set $S$ of up to $n^{0.5}$ labels such that $\mathsf{L}(u)=\mathsf{L}(w)\cup S$ (if $|\mathsf{L}(u)|<n^{0.5}$ , there is no $w$ ). Then, the lowest common ancestor of all leaves with labels belonging to $\mathsf{L}(u)$ is the lowest common ancestor of $\mathsf{finger}(w)$ and all leaves with labels belonging to $S$ . Therefore, $v$ can be found with $|S|$ lca queries in $\mathcal{O}(n^{0.5}\log n)$ time. Second, to check if $\mathsf{L}(v_{i})\cap\mathsf{L}(u)=\emptyset$ or $\mathsf{L}(v_{i})\subseteq\mathsf{L}(u)$ for every child $v_{i}$ of $v$ we distinguish two cases:

If $v$ is a proper ancestor of $\mathsf{finger}(w)$ we can calculate $|\mathsf{L}(v_{i})\cap\mathsf{L}(u)|$ for every $v_{i}$ in $\mathcal{O}(|S|\log n)=\mathcal{O}(n^{0.5}\log n)$ time as follows. Every edge has its associated counter. We assume that all counters are set to zero before starting the procedure and will make sure that they are cleared at the end. First, we use an $\mathsf{lca\_ext}(w,v)$ query to access the edge leading to the subtree containing $w$ and set its counter to $|\mathsf{L}(w)|$ . Then, we iterate over all $\ell\in S$ , retrieve the leaf $u$ of $T_{c}$ labelled with $\ell$ , and use an $\mathsf{lca\_ext}(u,v)$ query to access the edge leading to the subtree of $v$ containing $u$ and increase its counter by one. Additionally, whenever we access an edge for the first time (in this particular query), we add it to a temporary list $Q$ . After having processed all $\ell\in S$ , we iterate over $(v,v_{i})\in Q$ and check if the counter of $(v,v_{i})$ is equal to the number of leaves in the subtree rooted at $v_{i}$ (which requires retrieving the number of leaves). If this condition holds for every $(v,v_{i})\in Q$ then $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ and furthermore, the nodes $v_{i}$ such that $(v,v_{i})\in Q$ are exactly the ones that should be reconnected. Finally, we iterate over the edges in $Q$ again and reset their counters.

If $v=\mathsf{finger}(w)$ the situation is a bit more complicated because we might not have enough time to explicitly iterate over all children of $v$ that should be reconnected. Nevertheless, we can use a very similar method. Every edge has its associated counter (again, we assume that the counter are set to zero before starting the procedure and will make sure that they are cleared at the end). We also need a global counter $g$ , that is set to the total number of leaves in all subtrees rooted at either full or mixed children of $v$ decreased by $|\mathsf{L}(w)|$ . $g$ can be initialized in constant time in the first step of the procedure due to the additional information stored with every list of children. Intuitively, $g$ is how many leaves not belonging to $\mathsf{L}(w)$ we still have to see to conclude that indeed $\mathsf{L}(v_{i})\cap\mathsf{L}(u)=\emptyset$ or $\mathsf{L}(v_{i})\subseteq\mathsf{L}(u)$ for every child $v_{i}$ of $v$ . We iterate over $\ell\in S$ and access the edge $(v,v_{i})$ leading to the subtree containing $u$ labelled with $\ell$ . We decrease $g$ by one and, if $v_{i}$ is an empty child of $v$ and this is the first time we have seen $v_{i}$ (in this query) then we add the number of leaves in the subtree rooted at $v_{i}$ to $g$ . If, after having processed all $\ell\in S$ , $g=0$ then we conclude that $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ . The whole process takes $\mathcal{O}(|S|\log n)=\mathcal{O}(n^{0.5}\log n)$ time. ∎

Before explaining the details of how to update the additional information, we present the intuition. Recall that adding $\mathsf{L}(u)$ to $\mathcal{C}$ is done by creating a new child $v^{\prime}$ of $v$ and reconnecting some children of $v$ to $v^{\prime}$ . Let the set of all children of $v$ be $C$ and the set of children that should be reconnected be $C_{r}$ . Note that if $|C_{r}|=1$ or $|C|=|C_{r}|$ then we do not have to change anything in $T_{c}$ . Otherwise, updating $T_{c}$ can be implemented using two different methods:

Delete edges from nodes in $C_{r}$ to $v$ . Create a new tree consisting of a single node $v^{\prime}$ and make it a child of $v$ . Then, make all nodes in $C_{r}$ children of $v^{\prime}$ . 2. 2.

Delete edges from nodes in $C\setminus C_{r}$ to $v$ . Delete the edge from $v$ to its parent $w$ . Create a new tree consisting of a single node $v^{\prime}$ and make it a child of $w$ . Then, make $v$ a child of $v^{\prime}$ and also make all nodes in $C\setminus C_{r}$ children of $w$ . See Figure 4.

Thus, by using $C_{r}$ or $C\setminus C_{r}$ , the number of operations can be either $\mathcal{O}(|C_{r}|)$ or $\mathcal{O}(|C|-|C_{r}|)$ . We claim that by choosing the cheaper option we can guarantee that the total time for modifying the link-cut tree representation of $T_{c}$ is $\mathcal{O}(n\log^{2}n)$ . Intuitively, every edge of the final consensus tree participates in $\mathcal{O}(\log n)$ operations, and there are at most $n$ such edges. This is formalized in the following lemma.

Lemma 4.

$\min\{|C_{r}|,|C|-|C_{r}|\}$ * summed over all updates of $T_{c}$ is $n\log n$ .*

Proof.

We assume that $2\leq|C_{r}|<|C|$ in every update, as otherwise there is nothing to change in $T_{c}$ . Then, there are at most $n$ updates, as each of them creates a new inner node and there are never any nodes with degree 1 in $T_{c}$ .

We bound the sum of $\min\{|C_{r}|,|C|-|C_{r}|\}$ by assigning credits to inner nodes of $T_{c}$ . During the execution of the algorithm, a node $u$ with $b$ siblings should have $\log b$ credits. Thus, whenever we create a new inner node we need at most $\log n$ new credits, thus the total number of allocated credits is $n\log n$ . It remains to argue that, whenever we create a new child $v^{\prime}$ of $v$ and reconnect some of its children, the original credits of $v$ can be used to pay for the update and make sure that all children of $v$ and $v^{\prime}$ have enough credits after the update.

Denoting $x=|C_{r}|$ and $y=|C|-|C_{r}|$ , the cost of the update is $\min\{x,y\}$ . The total number of credits of all children of $v$ before the update is $(x+y)\log(x+y-1)$ . After the update, the number of credits of all children of $v$ is $(y+1)\log y\leq y\log y+\log n$ and the number of credits of all children of $v^{\prime}$ is $x\log(x+1)$ . Ignoring the $\log n$ new credits allocated to $v^{\prime}$ , the number of available credits is thus:

[TABLE]

which is at least $\min\{x,y\}$ for $x\geq 2$ , so enough to pay $\min\{|C_{r}|,|C|-|C_{r}|\}$ for the update. Hence, the sum is at most $n\log n$ . ∎

Before presenting the whole update procedure, we need one more technical lemma.

Lemma 5.

The procedure for checking if $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ can be requested to return $C_{r}$ in $\mathcal{O}(|C_{r}|+n^{0.5})$ time or $C\setminus C_{r}$ in $\mathcal{O}(|C|-|C_{r}|+n^{0.5})$ time.

Proof.

By inspecting the proof of Lemma 3, we see that there are two cases depending on whether $v$ is a proper ancestor of $\mathsf{finger}(w)$ or not.

If $v$ is a proper ancestor of $\mathsf{finger}(w)$ then $C_{r}$ can be obtained from $Q$ . More precisely, for every $(v,v_{i})\in Q$ we add $v_{i}$ to $C_{r}$ in $\mathcal{O}(|C_{r}|)$ total time. We can also obtain $C\setminus C_{r}$ in $\mathcal{O}(|C|)=\mathcal{O}(|C\setminus C_{r}|+|S|)=\mathcal{O}(|C|-|C_{r}|+n^{0.5})$ time. 2. 2.

If $v=\mathsf{finger}(w)$ then, while iterating over $\ell\in S$ , if this is the first time we have seen $v_{i}$ then we add $v_{i}$ to $C_{r}$ . Additionally, we add all full children of $v$ to $C_{r}$ . Thus, $C_{r}$ can be generated in $\mathcal{O}(|C_{r}|)$ time. Similarly, $C\setminus C_{r}$ consists of all empty children of $v$ without the nodes $v_{i}$ seen when iterating over $\ell\in S$ , and so can be generated in $\mathcal{O}(|C\setminus C_{r}|+|S|)=\mathcal{O}(|C|-|C_{r}|+n^{0.5})$ time.

Thus, we can always generate $C_{r}$ in $\mathcal{O}(|C_{r}|+n^{0.5})$ time and $C\setminus C_{r}$ in $\mathcal{O}(|C|-|C_{r}|+n^{0.5})$ time. ∎

To add $\mathsf{L}(u)$ to $\mathcal{C}$ , we will need to iterate over either $C_{r}$ or $C\setminus C_{r}$ (depending on which is smaller). After paying additional $\mathcal{O}(n^{0.5})$ time we can assume that we have access to a list of the elements in the appropriate set. The additional time sums up to $\mathcal{O}(n^{1.5})$ , because there can be only $n$ distinct new sets added to $\mathcal{C}$ .

Lemma 6.

If $\mathsf{L}(u)$ is compatible with $\mathcal{C}$ then, after adding $\mathsf{L}(u)$ to $\mathcal{C}$ and modifying $T_{c}$ we can update all additional information in amortized $\mathcal{O}(kn^{0.5}\log n)$ time assuming that we add $n$ such sets.

Proof.

Recall that $T_{c}$ is maintained using the data structure from Lemma 2, and adding $\mathsf{L}(u)$ to $\mathcal{C}$ is implemented by creating a new child $v^{\prime}$ of $v$ and reconnecting some of the children of $v$ to $v^{\prime}$ . $C$ is the set of all children of $v$ and $C_{r}$ is the set of children of $v$ that are reconnected to $v^{\prime}$ . If $|C_{r}|\leq|C|-|C_{r}|$ we iterate over $C_{r}$ and reconnect them one-by-one. If $|C_{r}|>|C|-|C_{r}|$ we iterate over $C\setminus C_{r}$ and reconnect them to a new node $w$ that is inserted between $v$ and its parent. To iterate over either $C_{r}$ or $C\setminus C_{r}$ , we extend the query procedure as explained in Lemma 5. This adds $\mathcal{O}(n^{0.5}$ to the time complexity, but then we can assume that the requested set can be generated in time proportional to its size. To unify the case of $|C_{r}|\leq|C|-|C_{r}|$ and $|C_{r}|>|C|-|C_{r}|$ , we think that $v$ is replaced with two nodes $v^{\prime}$ and $v^{\prime\prime}$ , where $v^{\prime}$ is the parent of $v^{\prime\prime}$ . All nodes in $C_{r}$ become children of $v^{\prime\prime}$ while all nodes of $C\setminus C_{r}$ become children of $v^{\prime}$ after iterating over either $C_{r}$ or $C\setminus C_{r}$ , depending on which set is smaller, so by Lemma 4 in the whole process we iterate over sets of total size $n\log n$ , so only amortized $\log n$ assuming that we add $n$ sets $\mathsf{L}(u)$ .

Consider a boundary node $u$ . If $\mathsf{finger}(u)\neq v$ then there is no need to update the additional information concerning $u$ . If $\mathsf{finger}(u)=v$ then we need to decide if the finger of $u$ should be set to $v^{\prime}$ or $v^{\prime\prime}$ and update the partition of the children of $\mathsf{finger}(u)$ accordingly. $\mathsf{finger}(u)$ should be set to $v^{\prime}$ exactly when, for any $w\in C\setminus C_{r}$ , $\mathsf{L}(w)\cap\mathsf{L}(u)=\emptyset$ or, in other words, all nodes in $C\setminus C_{r}$ are empty with respect to $u$ . The groups should be updated as follows:

If $\mathsf{finger}(u)$ is set to $v^{\prime\prime}$ then we should remove all nodes in $C\setminus C_{r}$ from the list of empty nodes with respect to $u$ (as they are no longer children of $\mathsf{finger}(u)$ ). Other groups remain unchanged. 2. 2.

If $\mathsf{finger}(u)$ is set to $v^{\prime}$ then we should remove all nodes in $C_{r}$ from the lists. Additionally, we need to insert $v^{\prime\prime}$ into the appropriate group: full if all nodes in $C_{r}$ were full, empty if all nodes in $C_{r}$ were empty, and mixed otherwise.

We need to show that all these conditions can be checked by either iterating over the nodes of $C$ or over the nodes of $C\setminus C_{r}$ , because we want to iterate over the smaller of these. This then guarantees that the amortized cost of updating the additional information for a boundary node is only $\mathcal{O}(\log n)$ , so amortized $\mathcal{O}(kn^{0.5}\log n)$ overall.

To check if all nodes in $C\setminus C_{r}$ are empty with respect to $u$ , we can either iterate over the nodes in $C\setminus C_{r}$ or iterate over all nodes in $C_{r}$ and check if all nodes in $C$ that are full or empty in fact belong to $C_{r}$ (this is possible because we also keep the total number of full and empty nodes in $C$ ). Thus, we can check if $\mathsf{finger}(u)$ should be set to $v^{\prime}$ .

If $\mathsf{finger}(u)$ is set to $v^{\prime}$ we need to decide where to put $v^{\prime\prime}$ . We only explain how to decide if all nodes in $C_{r}$ are full, as the procedure for empty is symmetric. We can either iterate over all nodes in $C_{r}$ and check that they are full or iterate over all nodes in $C\setminus C_{r}$ and check that all nodes in $C$ that are empty or mixed in fact belong to $C\setminus C_{r}$ (and thus do not belong to $C_{r}$ , so all nodes in $C_{r}$ are full). Finally, we add the number of leaves in the subtree rooted at $v^{\prime\prime}$ (extracted in $\mathcal{O}(\log n)$ time) to the appropriate sum.

It remains to describe how to remove all unnecessary nodes from the lists. Here we do not worry about having to iterate over the smaller set, because there are only $\mathcal{O}(n)$ new edges created during the whole execution of the algorithm, so we can afford to explicitly iterate over the nodes that should be removed, that is, over $C$ or $C\setminus C_{r}$ . For every removed node, we also subtract the number of leaves in its subtree (extracted in $\mathcal{O}(\log n)$ time) from the appropriate sum. Overall, this adds $\mathcal{O}(n\log n)$ per boundary node to the time complexity, so only amortized $\mathcal{O}(kn^{0.5}\log n)$ overall. ∎

Bibliography33

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] E. N. Adams III. Consensus techniques and the comparison of taxonomic trees. Systematic Zoology , 21(4):390–397, 1972.
2[2] M. Bayzid, S. Mirarab, B. Boussau, and T. Warnow. Weighted statistical binning: enabling statistically consistent genome-scale phylogenetic analyses. PLOS One , page e 0129183, 2015.
3[3] M. S. Bayzid and T. J. Warnow. Naive binning improves phylogenomic analyses. Bioinformatics , 29(18):2277–2284, 2013.
4[4] K. Bremer. Combinable component consensus. Cladistics , 6(4):369–372, 1990.
5[5] D. Bryant. A classification of consensus methods for phylogenetics. In Bioconsensuss, DIMACS Series in Discrete Mathematics and Theoretical Computer Science , volume 61, pages 163–184. 2003.
6[6] J. H. Degnan, M. De Giorgio, D. Bryant, and N. A. Rosenberg. Properties of consensus methods for inferring species trees from gene trees. Systematic Biology , 58(1):35–54, 2009.
7[7] J. Dong, D. Fernández-Baca, F. R. Mc Morris, and R. C. Powers. Majority-rule (+) consensus trees. Mathematical Biosciences , 228(1):10–15, 2010.
8[8] J. Felsenstein. Inferring Phylogenies . Sinauer Associates, Inc., Sunderland, Massachusetts, 2004.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

A Faster Construction of Greedy Consensus Trees

Abstract

1 Introduction

The frequency difference consensus tree.

Theorem 1**.**

The greedy consensus tree.

Theorem 2**.**

2 Computing the Identifiers

Lemma 1**.**

Proof.

Theorem 3**.**

Proof.

3 Simulating the Greedy Algorithm

Proposition 1**.**

Proposition 2**.**

Lemma 2**.**

Proof.

Lemma 3**.**

Proof.

Lemma 4**.**

Proof.

Lemma 5**.**

Proof.

Lemma 6**.**

Proof.

Theorem 1.

Theorem 2.

Lemma 1.

Theorem 3.

Proposition 1.

Proposition 2.

Lemma 2.

Lemma 3.

Lemma 4.

Lemma 5.

Lemma 6.