A Faster Construction of Greedy Consensus Trees
Pawe{\l} Gawrychowski, Gad M. Landau, Wing-Kin Sung, Oren Weimann

TL;DR
This paper presents significantly faster algorithms for constructing greedy and frequency difference consensus trees, reducing computational complexity from quadratic to near-linear time in key parameters, thereby improving phylogenetic analysis efficiency.
Contribution
The paper introduces improved algorithms that reduce the running time for computing greedy and frequency difference consensus trees from quadratic to near-linear time complexities.
Findings
Greedy consensus tree algorithm improved to O(k n^{1.5})
Frequency difference consensus tree algorithm improved to O(k n)
Significant reduction in computational complexity for phylogenetic consensus methods
Abstract
A consensus tree is a phylogenetic tree that captures the similarity between a set of conflicting phylogenetic trees. The problem of computing a consensus tree is a major step in phylogenetic tree reconstruction. It also finds applications in predicting a species tree from a set of gene trees. This paper focuses on two of the most well-known and widely used oconsensus tree methods: the greedy consensus tree and the frequency difference consensus tree. Given conflicting trees each with leaves, the previous fastest algorithms for these problems were for the greedy consensus tree [J. ACM 2016] and for the frequency difference consensus tree [ACM TCBB 2016]. We improve these running times to and respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
A Faster Construction of Greedy Consensus Trees
Paweł Gawrychowski
University of Haifa, Israel
Gad M. Landau
University of Haifa, Israel
Wing-Kin Sung
National University of Singapore, Singapore
Oren Weimann
University of Haifa, Israel
Abstract
A consensus tree is a phylogenetic tree that captures the similarity between a set of conflicting phylogenetic trees. The problem of computing a consensus tree is a major step in phylogenetic tree reconstruction. It also finds applications in predicting a species tree from a set of gene trees. This paper focuses on two of the most well-known and widely used consensus tree methods: the greedy consensus tree and the frequency difference consensus tree. Given conflicting trees each with leaves, the previous fastest algorithms for these problems were for the greedy consensus tree [J. ACM 2016] and for the frequency difference consensus tree [ACM TCBB 2016]. We improve these running times to and respectively.
1 Introduction
A phylogenetic tree describes the evolutionary relationships among a set of species called taxa. It is an unordered rooted tree whose leaves represent the taxa and whose inner nodes represent their common ancestors. Each leaf has a distinct label from . The inner nodes are unlabeled and have at least two children.
Numerous phylogenetic trees, reconstructed from data sources like fossils or DNA sequences, have been published in the literature since the early 1860s. However, the phylogenetic trees obtained from different data sources or using different reconstruction methods result in conflicts (similar though not identical phylogenetic trees over the same set of leaf labels). The conflicts between phylogenetic trees are usually measured by their difference in signatures: The signature of a phylogenetic tree is the set where denotes the set of labels of all leaves in the subtree rooted at node of (the set is sometimes called a cluster). To deal with the conflicts between phylogenetic trees in a systematic manner, the concept of a consensus tree was invented. Informally, the consensus tree is a single phylogenetic tree that summarizes the branching structure (signatures) of all the conflicting trees. Consensus trees have been widely used in two applications:
Constructing a phylogenetic tree: First, by sampling the dataset, we generate different datasets (for some constant that can be as large as ). Then, we reconstruct one phylogenetic tree for each dataset. Finally, we build the consensus tree of these trees. 2. 2.
Constructing a species tree: First, a phylogenetic tree (called a gene tree) is reconstructed for each individual gene. Then, the species tree is created by building the consensus tree of all gene trees.
Many different types of consensus trees have been proposed in the literature. For almost all of them, optimal or near-optimal time constructions are known. These include Adam’s consensus tree [1], strict consensus tree [27], loose consensus tree [4, 13], majority-rule consensus tree [17, 13], majority-rule (+) consensus tree [11], and asymmetric median consensus tree [20, 21]111Constructing the asymmetric median consensus tree was proven to be NP-hard for [20] and solvable in time for [21].. Two of the most notable exceptions are the frequency difference consensus tree [10] and the greedy consensus tree [5, 9] whose running time remains quadratic in either or . In particular, the former can be constructed in time [11] and the later in time [13]. For more details about different consensus trees and their advantages and disadvantages see the survey in [5], Chapter 30 in [8], and Chapter 8.4 in [31].
In this paper we propose novel algorithms for the frequency difference consensus tree problem and the greedy consensus tree problem. First, we present an time deterministic labeling method. The labelling method counts the frequency (number of occurrences) of every cluster in the input trees. Based on this labeling method, we obtain an time construction of the frequency difference consensus tree. Then, for the greedy consensus tree, we present our main technical contribution: a method that uses micro-macro decomposition to verify if a cluster is compatible with a tree in time and, if so, modify to include in amortized time. Using this procedure, we obtain an time construction of the greedy consensus tree.
The frequency difference consensus tree.
The frequency of a cluster (a set of labels of all leaves in some subtree) is the number of trees that contain . A cluster is said to be compatible with another cluster if they are either disjoint or one is included in the other. A frequent cluster is a cluster that occurs in more trees than any of the clusters that are incompatible with it. The frequency difference consensus tree is a tree whose signature is exactly all the frequent clusters.
The frequency difference consensus tree was initially proposed by Goloboff et al. [10], and its relationship with other consensus trees was studied in [7]. In particular, it can be seen as a refinement of the majority-rule consensus tree [17, 13]. Moreover, it is known to give less noisy branches than the greedy consensus tree defined below. Steel and Velasco [30] concluded that “the frequency difference method is worthy of more widespread usage and serious study”. A naive construction of the frequency difference consensus tree takes time. The free software TNT [10] has implemented a heuristics method to construct it more efficiently. However, its time complexity remains unknown.
Recently, Jansson et al. [11] presented an time construction (implemented in the FACT software package [12]). Their algorithm first computes the frequency of every cluster with non-zero frequency. This is done in total time. They then show that given these computed frequencies, the frequency difference consensus tree can be computed in additional time. In Section 2 we show how to compute all frequencies in total time leading to the following theorem:
Theorem 1**.**
The frequency difference consensus tree of phylogenetic trees on the same set of leaves can be computed in time.
To prove the above theorem, we first develop an time algorithm for assigning a number to every such that iff . With these numbers in hand, we can then compute the frequencies of all clusters in time using counting sort (since there are only clusters with non-zero frequencies, and each was assigned an integer bounded by ). Notice that this also generates a sorted list of all clusters with non-zero frequencies.
The greedy consensus tree.
We say that a given collection of subsets of is consistent if there exists a phylogenetic tree such that the signature of is exactly . The greedy consensus tree is defined by the following procedure: We begin with an initially empty and then consider all clusters in decreasing order of their frequencies. In this order, for every , we check if is consistent, and if so we add to .
The greedy consensus tree is one of the most well-known consensus trees. It has been used in numerous papers such as [6, 23, 14, 18, 3, 24, 29, 19, 2, 15, 16, 26, 33] to name a few. For example, in a recent landmark paper in Nature [23], it was used to construct the species tree from 1000 gene trees of yeast genomes, and in [6] it was asserted that “The greedy consensus tree offers some robustness to gene-tree discordance that may cause other methods to fail to recover the species tree. In addition, the greedy consensus method outperformed our other methods for branch lengths outside the too-greedy zone.”.
The greedy consensus tree is an extension of the majority-rule consensus tree, and is sometimes called the extended majority-rule consensus (eMRC) tree. It is implemented in popular phylogenetics software packages like PHYLIP [9], PAUP* [32], MrBayes [22], and RAxML [28]. A naive construction of the greedy consensus tree requires time [5]. To speed this up, these software packages use some forms of randomization methods. For example, PHYLIP uses hashing to improve the running time. Even with randomization, the time complexities of these solutions are not known. Recently, Jansson et al. [13] gave the best known provable construction with an deterministic running time (their implementation is also part of the FACT package). In Section 3 we present our main contribution, a deterministic construction as stated by the following theorem:
Theorem 2**.**
The greedy consensus tree of phylogenetic trees on the same set of leaves can be computed in time.
To prove the above theorem, we develop a generic procedure that takes any ordered list of clusters and tries adding them one-by-one to the current solution . We assume that every cluster is specified by providing a tree and a node such that . Our procedure requires time per cluster (to add this cluster to or assert that it cannot be added) and needs not to assume anything about the order of the clusters. In particular, it does not rely on the clusters being sorted by frequencies.
2 Computing the Identifiers
We process the nodes of every in the bottom-up order. For every node , we compute the identifier by updating the following structure called the dynamic set equality structure:
Lemma 1**.**
There exists a dynamic set equality structure that supports: (1) create a new empty structure in constant time, (2) add to the current set in time, (3) return the identifier of the current set in constant time, and (4) list all elements of the current set in time. The structure ensures that the identifiers are bounded by the total number of update operations performed so far, and that two sets are equal iff their identifiers are equal.
Proof.
To allow for listing all elements of the current set , we store them in a list. Before adding the new element to the list, we need to check if . This will be done using the representation described below.
Conceptually, we work with a complete binary tree on leaves labelled with when read from left to right (without losing generality, ), where every node corresponds to a set defined by the leaves in its subtree (note that , where ). Now, any set is associated with a binary tree , where we write in a leaf if the corresponding element belongs to and [math] otherwise. Then, for every node we define its characteristic vector by writing down the values written in the leaves of its subtree in the natural order (from left to right). Clearly, the vector of an inner node is obtained by concatenating the vector of its children. We want to maintain identifiers of all nodes, so that the identifiers of two nodes are equal iff their characteristic vectors are identical. If we can keep the identifiers small, then the identifier of the current set can be computed as the identifiers of the root of .
Assume that we have already computed the identifiers of all nodes in and now want to add to . This changes the value in the leaf corresponding to and, consequently, the characteristic vectors of all ancestors of . However, it does not change the characteristic vectors of any other node. Therefore, we traverse the ancestors of starting from and recompute their identifiers. Let be the current node. If we have never seen the characteristic vector of before, we can set the identifier of to be the largest already used identifier plus one. Otherwise, we have to set the identifier of to be the same as the one previously used for a node with such a characteristic vector. As mentioned above, the characteristic vector of an inner node is the concatenation of the characteristic vectors of its children and . We maintain a dictionary mapping a pair consisting of the identifier of and the identifier of to the identifier of . The dictionary is global, that is, shared by all instances of the structure. Then, assuming that we have already computed the up-to-date identifiers of and , we only need to query the dictionary to check if the identifier of should be set to the largest already used identifier plus one (which is exactly when the dictionary does not contain the corresponding pair) or retrieve the appropriate identifier. Therefore, adding to reduces to queries to the dictionary. By implementing the dictionary with balanced search trees, we therefore obtain the claimed time for adding an element.
We are not completely done yet, because creating a new complete binary tree takes time and therefore the initialization time is not constant yet. However, we can observe that it does not make sense to explicitly maintain a node of such that , because we can assume that the identifier of such an is 0. In other words, we can maintain only the part of induced by the leaves corresponding to . Adding an element is implemented as above, except that we might need to create (at most ) new nodes on the leaf-to-root path corresponding to (if such a leaf already exists, we terminate the procedure as already) and then recompute the identifiers on the whole path as described above. ∎
Armed with Lemma 1, we process every bottom-up. Consider an inner node and let be its children ordered so that , that is, the subtree rooted at is the largest. Assuming that we have already stored every in a dynamic set equality structure, we construct a dynamic set equality structure storing by simply inserting all elements of into the structure of . This takes time per element. Then, we set to be the identifier of the obtained structure. By a standard argument (heavy path decomposition), every leaf of is inserted into at most structures and therefore the whole is processed in time. This gives us the claimed total time.
We now proceed with a faster total time solution. While this is irrelevant for our time construction of the greedy consensus tree, it implies a better complexity for constructing the frequency difference consensus tree.
We start with a high-level intuition. Lemma 1 is, in a sense, more than we need, as it is not completely clear that we need to immediately compute the identifier of the current set. Indeed, applying heavy path decomposition we can partially delay computing the identifiers by proceeding in phases. In each phase, we can then replace the dynamic dictionary used to store the mapping with a radix sort. Intuitively, this shaves one log from the time complexity. We proceed with a detailed explanation.
Theorem 3**.**
The numbers can be found for all nodes of the phylogenetic trees in total time.
Proof.
For a node , define its level to be , such that . Thus, the levels are between [math] and , level of a node is at least as large as the levels of its children, and a node on level has at most one child on the same level. We work in phases . In phase , we assume that the numbers are already known for all nodes , such that , and want to assign these numbers to all nodes , such that . We will show how to achieve this in time, thus proving the theorem.
Consider all nodes , such that . Because every such has at most one child at the same level, all level- nodes in can be partitioned into maximal paths of the form , where the level of the parent of is larger than (or is the root of ), and the levels of all children of (except for , if defined) are smaller than . is called the head of and denoted . Now, our goal is to find with the required properties for every . We will actually achieve a bit more. The sets are disjoint in every tree , and thus we can define, for every , a partition of the set of leaves , where every corresponds to a level- path in , such that . The elements of are then ordered, and we think that is a sequence of length . The ordering is chosen so that, for every , the set corresponds to some prefix of . denotes the prefix of of length . We will assign identifiers to all such prefixes , for every , and , with the property that the identifiers of two prefixes are equal iff the sets of leaves appearing in both of them are equal. Then, we can extract the required in constant time each by taking the identifiers of some .
Recall that in the slower solution we worked with a complete binary tree on leaves. For every set in the collection and every , we computed an identifier of the set . This was possible, because if and are the left and the right child of , respectively, then the identifier of can be found using the identifiers of and . We need to show that retrieving these identifiers can be batched.
Fix a node and, for every and , consider all prefixes for . We create a version of for every such prefix. The version corresponds to the set containing all elements of occurring in the prefix . We want to assign identifiers to all versions of . First, observe that we only have to create a new version if , as otherwise the set is the same as for . Thus, the total number of required versions, when summed over all nodes on the same depth in , is only , as a leaf of creates exactly new version for some . For every node , we will store a list of all its versions. A version consists of its identifier (such that the identifier of two versions is the same iff the corresponding sets are equal) together with the indices , and . We describe how to create such a list for every node at the same depth given the lists for all nodes at depth next.
Let and be the left and the right child of , respectively. Then, we need to create a new version of for every new version of and every new version of , because for the set corresponding to to change either the set corresponding to or the set corresponding to must change, and every change is adding one new element. Fix and and consider all versions of corresponding to and sorted according to . Let the sorted list of their ’s be . Similarly, consider all versions of corresponding to and sorted according to , and let the sorted list of their ’s be . For every , we create a new version of corresponding to , , and equal to . This is done by retrieving the version of with equal to , such that and is maximized, and the version of with equal to , such that and is maximized. Then, the identifier of the new version of can be constructed from the pair consisting of the identifiers of these versions of and (this is essentially the same reasoning as in the slower solution). We could now use a dictionary to map these pairs to identifiers. However, we can also observe that, in fact, we have reduced finding the identifiers of all versions of all nodes at the same depth to identifying duplicates on a list of pairs of numbers from . This can be done by radix sorting all pairs in linear time (more precisely, time and space), and then sweeping through the sorted list while assigning the identifiers. This takes only time for every depth , so for every level as claimed. ∎
The proof of Theorem 1 follows immediately from Theorem 3.
3 Simulating the Greedy Algorithm
We consider trees on the same set of leaves , and assume that every node has an identifier such that iff . We next develop a general method for maintaining a solution (i.e., a set of compatible identifiers) so that, given any node , we are able to efficiently check if is compatible with , meaning that is consistent, and if so add to . Our method does not rely on the order in which the sets arrive and in particular can be used to run the greedy algorithm.
We represent with a phylogenetic tree such that . is called the current consensus tree. By Lemma 2.2 of [13], is compatible with iff there exists a node such that for every child of either or . Also, adding to can be done by creating a new child of and reconnecting every original child of such that to the new . This is illustrated in Figure 1.
Initially, consists only of leaves attached to the common root (which corresponds to ). Our goal is to to maintain some additional information so that given any node , we can check if is compatible with in time. After adding to the information will be updated in amortized time. To explain the intuition, we first show how to check if is compatible with in roughly time.
Let and let be the leaf of labelled with . Then, must be an ancestor of every . We claim that, in fact, should be chosen as the lowest common ancestor of , because if all ’s are in the same subtree rooted at a child of then we can as well replace with . So, we can find by asking lca queries: we start with and then iteratively jump to the lca of the current node and . Assuming that we represent in such a way that an lca query can be answered efficiently, this takes roughly time. Then, we need to decide if for every child of it holds that or . This can be done by computing, for every such , how many ’s belong to the subtree rooted at , and then checking if this number is either 0 or . To compute these numbers, we maintain a counter for every . Then, for every we retrieve the child of such that belongs to the subtree rooted at and increase the counter of . Assuming that we represent so that such can be retrieved efficiently, this again takes roughly time. Finally, we iterate over all again, retrieve the corresponding and check if its counter is equal to (so our representation of should also allow retrieving the number of leaves in a subtree). If not, then is not compatible with , see Figure 2. Otherwise, we create the new node and reconnect to all children of , such that the counter of is equal to .
We would like to avoid explicitly iterating over all elements of . This will be done by maintaining some additional information, so that we only have to iterate over up to elements. To explain what is the additional information we need the (standard) notion of a micro-macro decomposition. Let be a parameter and consider a binary tree on nodes. We want to partition it into node-disjoint subtrees called micro trees. Each micro tree is of size at most and contains at most two boundary nodes that are adjacent to nodes in other micro trees. One of these boundary nodes, called the top boundary node, is the root of the whole micro tree, and the other is called the bottom boundary node. Such a partition is always possible and can be found in time.
We binarize every to obtain . Then, we find a micro-macro decomposition of with . By properties of the decomposition we have the following:
Proposition 1**.**
For any such that , there exists a boundary node such that can be obtained by adding at most elements to . Furthermore, and these up to elements can be retrieved in time after preprocessing.
The total number of boundary nodes is only . For each such boundary node , we maintain a pointer to a node called the finger of . is a node such that but, for every child of , .
Proposition 2**.**
The node is the lowest common ancestor in of all leaves with labels belonging to .
Additionally, the children of are partitioned into three groups: (1) such that , (2) such that , and (3) the rest. We call them full, empty, and mixed, respectively (with respect to ). For each group we maintain a list storing all nodes in the group, every node knows its group, and the group knows it size. Additionally, every group knows the total number of leaves in all subtrees rooted at its nodes.
We also need to augment the representation to allow for efficient extended lca queries. The lowest common ancestor (lca) of and is the leafmost node that is an ancestor of both and . An extended lca query, denoted , returns the first edge on the path from the lca of and to , and -1 if is an ancestor of . For example, in Figure 2, whereas is the edge between and its leftmost child.
Lemma 2**.**
We can maintain a collection of rooted trees under: (1) create a new tree consisting of a single node, (2) make the root of one tree a child of a node in another tree, (3) delete an edge from a node to its parent, (4) count leaves in the tree containing a given node, and (5) extended lca queries, all in amortized time, where is the total size of all trees in the collection.
Proof.
We apply the link/cut trees of Sleator and Tarjan [25] to maintain the collection. This immediately gives us the first three operations. To implement computing the size and queries we need to explain the internals of link/cut trees. Each tree is partitioned into node-disjoint paths consisting of preferred edges. Each node has at most one such edge leading to its preferred child. For each maximal path consisting of preferred edges, called a preferred path, we store its nodes in a splay tree, where the left-to-right order on the nodes of the splay tree corresponds to the top-bottom order on the nodes in the rooted tree. Each such splay tree stores a pointer to the topmost node of its preferred path. Additionally, each node of the tree stores a pointer to its current parent. All operations on a link/cut tree use the access procedure. Its goal is to change the preferred edges so that there is a preferred path starting at the root and ending at . This is done by first shortening the preferred path containing so that it ends at . Then, we iteratively jump to the topmost node of the current preferred path and make the preferred child of its parent. Whenever the preferred child of a node changes, we need to update the splay tree representing the nodes of the preferred path. Even though the number of jumps might be , it can be shown that all these updates take amortized time.
Now we can explain how to implement . First, we access node . This gives us a preferred path starting at the root and ending at . Second, we access node while keeping track of the topmost nodes of the visited preferred paths. If is on the same preferred path as , then is an ancestor of . Otherwise, let be the preferred path visited just before reaching the preferred path starting at the root of the whole tree. Then the topmost node of (before changing the preferred child of its parent) should be returned as . Thus, the complexity of is the same as the complexity of access.
To compute the size of a tree, we augment the splay trees. Every node of a preferred path stores the total number of leaves in all subtrees attached to it through non-preferred edges (plus one if the node itself is a leaf). Additionally, every node of a splay tree stores the sum of the numbers stored in its subtree, or in other words the total number of leaves in all subtrees attached to its corresponding contiguous fragment of the preferred path through non-preferred edges. The sums stored at the nodes of the splay tree are easily maintained during rotations. We also need to update the total number of leaves after making a preferred edge non-preferred or vice versa. This is easily done by accessing the sum stored at the root of the splay tree. To access the number the leaves in the tree containing , we need to access . This makes all of ’s children non-preferred and makes the root of its splay tree. Hence, the number stored at is the total number of leaves in the tree containing . ∎
We next show how to efficiently check for any if is compatible with . By the following lemma, this can be done in time, assuming we have stored the required additional information. Recall that this additional information includes:
The tree maintained using Lemma 2. 2. 2.
For every boundary node , we store . 3. 3.
For every boundary node , we store three lists containing the full, the mixed, and the empty children of respectively. Each list also stores the total number of leaves in all subtrees rooted at its nodes.
Lemma 3**.**
Assuming access to the above additional information, given any node we can check if is compatible with in time.
Proof.
By Lemma 2.2 of [13], to check if is compatible with we need to check if there exists a node such that for every child of either or . First, observe that can be chosen as the lowest common ancestor of all leaves with labels belonging to . By properties of the micro-macro decomposition, we can retrieve a boundary node and a set of up to labels such that (if , there is no ). Then, the lowest common ancestor of all leaves with labels belonging to is the lowest common ancestor of and all leaves with labels belonging to . Therefore, can be found with lca queries in time. Second, to check if or for every child of we distinguish two cases:
If is a proper ancestor of we can calculate for every in time as follows. Every edge has its associated counter. We assume that all counters are set to zero before starting the procedure and will make sure that they are cleared at the end. First, we use an query to access the edge leading to the subtree containing and set its counter to . Then, we iterate over all , retrieve the leaf of labelled with , and use an query to access the edge leading to the subtree of containing and increase its counter by one. Additionally, whenever we access an edge for the first time (in this particular query), we add it to a temporary list . After having processed all , we iterate over and check if the counter of is equal to the number of leaves in the subtree rooted at (which requires retrieving the number of leaves). If this condition holds for every then is compatible with and furthermore, the nodes such that are exactly the ones that should be reconnected. Finally, we iterate over the edges in again and reset their counters.
If the situation is a bit more complicated because we might not have enough time to explicitly iterate over all children of that should be reconnected. Nevertheless, we can use a very similar method. Every edge has its associated counter (again, we assume that the counter are set to zero before starting the procedure and will make sure that they are cleared at the end). We also need a global counter , that is set to the total number of leaves in all subtrees rooted at either full or mixed children of decreased by . can be initialized in constant time in the first step of the procedure due to the additional information stored with every list of children. Intuitively, is how many leaves not belonging to we still have to see to conclude that indeed or for every child of . We iterate over and access the edge leading to the subtree containing labelled with . We decrease by one and, if is an empty child of and this is the first time we have seen (in this query) then we add the number of leaves in the subtree rooted at to . If, after having processed all , then we conclude that is compatible with . The whole process takes time. ∎
Before explaining the details of how to update the additional information, we present the intuition. Recall that adding to is done by creating a new child of and reconnecting some children of to . Let the set of all children of be and the set of children that should be reconnected be . Note that if or then we do not have to change anything in . Otherwise, updating can be implemented using two different methods:
Delete edges from nodes in to . Create a new tree consisting of a single node and make it a child of . Then, make all nodes in children of . 2. 2.
Delete edges from nodes in to . Delete the edge from to its parent . Create a new tree consisting of a single node and make it a child of . Then, make a child of and also make all nodes in children of . See Figure 4.
Thus, by using or , the number of operations can be either or . We claim that by choosing the cheaper option we can guarantee that the total time for modifying the link-cut tree representation of is . Intuitively, every edge of the final consensus tree participates in operations, and there are at most such edges. This is formalized in the following lemma.
Lemma 4**.**
* summed over all updates of is .*
Proof.
We assume that in every update, as otherwise there is nothing to change in . Then, there are at most updates, as each of them creates a new inner node and there are never any nodes with degree 1 in .
We bound the sum of by assigning credits to inner nodes of . During the execution of the algorithm, a node with siblings should have credits. Thus, whenever we create a new inner node we need at most new credits, thus the total number of allocated credits is . It remains to argue that, whenever we create a new child of and reconnect some of its children, the original credits of can be used to pay for the update and make sure that all children of and have enough credits after the update.
Denoting and , the cost of the update is . The total number of credits of all children of before the update is . After the update, the number of credits of all children of is and the number of credits of all children of is . Ignoring the new credits allocated to , the number of available credits is thus:
[TABLE]
which is at least for , so enough to pay for the update. Hence, the sum is at most . ∎
Before presenting the whole update procedure, we need one more technical lemma.
Lemma 5**.**
The procedure for checking if is compatible with can be requested to return in time or in time.
Proof.
By inspecting the proof of Lemma 3, we see that there are two cases depending on whether is a proper ancestor of or not.
If is a proper ancestor of then can be obtained from . More precisely, for every we add to in total time. We can also obtain in time. 2. 2.
If then, while iterating over , if this is the first time we have seen then we add to . Additionally, we add all full children of to . Thus, can be generated in time. Similarly, consists of all empty children of without the nodes seen when iterating over , and so can be generated in time.
Thus, we can always generate in time and in time. ∎
To add to , we will need to iterate over either or (depending on which is smaller). After paying additional time we can assume that we have access to a list of the elements in the appropriate set. The additional time sums up to , because there can be only distinct new sets added to .
Lemma 6**.**
If is compatible with then, after adding to and modifying we can update all additional information in amortized time assuming that we add such sets.
Proof.
Recall that is maintained using the data structure from Lemma 2, and adding to is implemented by creating a new child of and reconnecting some of the children of to . is the set of all children of and is the set of children of that are reconnected to . If we iterate over and reconnect them one-by-one. If we iterate over and reconnect them to a new node that is inserted between and its parent. To iterate over either or , we extend the query procedure as explained in Lemma 5. This adds to the time complexity, but then we can assume that the requested set can be generated in time proportional to its size. To unify the case of and , we think that is replaced with two nodes and , where is the parent of . All nodes in become children of while all nodes of become children of after iterating over either or , depending on which set is smaller, so by Lemma 4 in the whole process we iterate over sets of total size , so only amortized assuming that we add sets .
Consider a boundary node . If then there is no need to update the additional information concerning . If then we need to decide if the finger of should be set to or and update the partition of the children of accordingly. should be set to exactly when, for any , or, in other words, all nodes in are empty with respect to . The groups should be updated as follows:
If is set to then we should remove all nodes in from the list of empty nodes with respect to (as they are no longer children of ). Other groups remain unchanged. 2. 2.
If is set to then we should remove all nodes in from the lists. Additionally, we need to insert into the appropriate group: full if all nodes in were full, empty if all nodes in were empty, and mixed otherwise.
We need to show that all these conditions can be checked by either iterating over the nodes of or over the nodes of , because we want to iterate over the smaller of these. This then guarantees that the amortized cost of updating the additional information for a boundary node is only , so amortized overall.
To check if all nodes in are empty with respect to , we can either iterate over the nodes in or iterate over all nodes in and check if all nodes in that are full or empty in fact belong to (this is possible because we also keep the total number of full and empty nodes in ). Thus, we can check if should be set to .
If is set to we need to decide where to put . We only explain how to decide if all nodes in are full, as the procedure for empty is symmetric. We can either iterate over all nodes in and check that they are full or iterate over all nodes in and check that all nodes in that are empty or mixed in fact belong to (and thus do not belong to , so all nodes in are full). Finally, we add the number of leaves in the subtree rooted at (extracted in time) to the appropriate sum.
It remains to describe how to remove all unnecessary nodes from the lists. Here we do not worry about having to iterate over the smaller set, because there are only new edges created during the whole execution of the algorithm, so we can afford to explicitly iterate over the nodes that should be removed, that is, over or . For every removed node, we also subtract the number of leaves in its subtree (extracted in time) from the appropriate sum. Overall, this adds per boundary node to the time complexity, so only amortized overall. ∎
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] E. N. Adams III. Consensus techniques and the comparison of taxonomic trees. Systematic Zoology , 21(4):390–397, 1972.
- 2[2] M. Bayzid, S. Mirarab, B. Boussau, and T. Warnow. Weighted statistical binning: enabling statistically consistent genome-scale phylogenetic analyses. PLOS One , page e 0129183, 2015.
- 3[3] M. S. Bayzid and T. J. Warnow. Naive binning improves phylogenomic analyses. Bioinformatics , 29(18):2277–2284, 2013.
- 4[4] K. Bremer. Combinable component consensus. Cladistics , 6(4):369–372, 1990.
- 5[5] D. Bryant. A classification of consensus methods for phylogenetics. In Bioconsensuss, DIMACS Series in Discrete Mathematics and Theoretical Computer Science , volume 61, pages 163–184. 2003.
- 6[6] J. H. Degnan, M. De Giorgio, D. Bryant, and N. A. Rosenberg. Properties of consensus methods for inferring species trees from gene trees. Systematic Biology , 58(1):35–54, 2009.
- 7[7] J. Dong, D. Fernández-Baca, F. R. Mc Morris, and R. C. Powers. Majority-rule (+) consensus trees. Mathematical Biosciences , 228(1):10–15, 2010.
- 8[8] J. Felsenstein. Inferring Phylogenies . Sinauer Associates, Inc., Sunderland, Massachusetts, 2004.
