Distinguishing Phylogenetic Networks
Elizabeth Gross, Colby Long

TL;DR
This paper explores the mathematical properties of Markov models on large-cycle phylogenetic networks, demonstrating that their topology can be generically identified using algebraic geometry tools.
Contribution
It introduces a novel analysis of large-cycle networks, proving generic identifiability of their semi-directed topology with algebraic geometry methods.
Findings
Semi-directed network topology is generically identifiable.
Uses computational algebraic geometry to analyze phylogenetic models.
Focuses on large-cycle networks with cycles of length at least four.
Abstract
Phylogenetic networks are becoming increasingly popular in phylogenetics since they have the ability to describe a wider range of evolutionary events than their tree counterparts. In this paper, we study Markov models on phylogenetic networks and their associated geometry. We restrict our attention to large-cycle networks, networks with a single undirected cycle of length at least four. Using tools from computational algebraic geometry, we show that the semi-directed network topology is generically identifiable for Jukes-Cantor large-cycle network models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvolution and Paleontology Studies · Genomics and Phylogenetic Studies · Genetic diversity and population structure
Distinguishing Phylogenetic Networks
Elizabeth Gross and Colby Long
Department of Mathematics and Statistics, One Washington Square, San José State University, San José, CA, 95192-0103, USA
Mathematical Biosciences Institute, The Ohio State University, 1735 Neil Ave., Columbus OH, 43210, USA
Abstract.
Phylogenetic networks are becoming increasingly popular in phylogenetics since they have the ability to describe a wider range of evolutionary events than their tree counterparts. In this paper, we study Markov models on phylogenetic networks and their associated geometry. We restrict our attention to large-cycle networks, networks with a single undirected cycle of length at least four. Using tools from computational algebraic geometry, we show that the semi-directed network topology is generically identifiable for Jukes-Cantor large-cycle network models.
1. Introduction
There are many reasons why a single phylogenetic tree may fail to fully describe the evolutionary history of a group of taxa. Issues such as hybridization, horizontal gene transfer, and incomplete lineage sorting are known to cause discordance between gene trees [21, 25, 33]. To account for hybridization and horizontal gene transfer, evolutionary phylogenetic networks have recently come to the foreground of phylogenetics. These networks model the process of evolution along a directed acyclic graph where certain edges in the network represent reticulation events. Since the network topology is meant to reflect the actual history of a group of taxa, the topologies are often constrained to a class of networks considered biologically reasonable. Because of their increasing importance, many concepts from modeling and inference on phylogenetic trees are now being applied to phylogenetic networks. For example, a number of authors have incorporated the coalescent process into network models to account for incomplete lineage sorting [18, 29, 35]. There have also been a number of papers exploring network inference [10, 16, 17] and the combinatorial properties of different classes of phylogenetic networks [8, 13, 28]. In fact, the body of work in this area has grown to the point that there are now several surveys on the topic (e.g., [22, 14, 23]). Despite this, the present paper is one of the first to analyze the algebraic and geometric properties of phylogenetic networks.
The implicit goal underlying much of this work is to eventually be able to infer the phylogenetic network that explains the evolutionary history of a group of taxa from biological data. Thus, a fundamental question about any phylogenetic network model is the identifiability of the underlying network, that is, whether or not the network topology can be uniquely identified from data generated by the network. Some positive results in this direction have been proven, for example, in [29] it is shown that there are -leaf networks that can be uniquely identified from the quartet topology distribution induced by the coalescent process. However, there are other results that should give pause to those attempting to reconstruct networks. For example, it is shown in [22] that two distinct networks can share the same set of subtrees. Likewise, in [26] the authors show that two topologically distinct weighted networks can share the same set of weighted subtrees.
In this paper, we will consider Jukes-Cantor network models where the process of DNA sequence evolution is modeled as a Markov process proceeding along an -leaf directed acyclic graph (DAG). We are particularly interested in the identifiability (or lack thereof) of the network topology from the distribution on -tuples of DNA bases generated by the network. This is distinct from the notion of identifiability discussed in [26], as we do not assume any knowledge about which sites were produced by the same subtree of the network. The two-state Cavender-Farris-Neyman model may seem the more natural starting point for our exploration of network identifiabilty. However, as is evident from our computations in Proposition 4.7, the restricted coordinate space for this model makes it impossible to identify small networks from one another, our main strategy for eventually proving identifiability in the Jukes-Cantor case.
Since the Jukes-Cantor model is time-reversible, the precise location of the root within the network will be unidentifiable from the distribution. However, we cannot simply study the unrooted topology of networks without orientation, since reticulation edges, edges directed into vertices of indegree two, play a special role defining the distribution. Thus, our results concern the identifiability of the semi-directed network topology, the unrooted, undirected network with distinguished reticulation edges. We will also restrict our attention to networks with only a single reticulation vertex which we call cycle-networks. We will refer to the set of all cycle-networks with cycle length greater than 4 as the class of large-cycle networks. The main result of this paper is the following theorem.
Theorem 1.1**.**
The semi-directed network topology parameter of large-cycle Jukes-Cantor network models is generically identifiable.
Markov models on networks with a single reticulation vertex are very closely related to 2-tree mixture models but with some subtle differences that we discuss in Section 2.1. Using techniques from algebraic statistics, it is shown in [2] that the tree parameters of a 2-tree Jukes-Cantor mixture are generically identifiable. Here we adopt a similar approach. We associate to each network an algebraic variety that is the Zariski closure of the set of probability distributions attained by varying the numerical parameters in the model on . We then study the associated ideals of the networks to find algebraic invariants that distinguish networks from one another. The two networks in Figure 1 demonstrate that the generic identifiability results for 2-tree mixtures do not apply for phylogenetic networks. These networks have different semi-directed network topologies and induce different multisets of embedded trees. Suprisingly, however, the algebraic variety for the network on the left is properly contained in that of the network on the right. This example highlights another contrast between cycle-networks and 2-tree mixtures, as the varieties for distinct -leaf cycle-networks need not even be the same dimension.
This paper is organized as follows. In Section 2, we introduce the appropriate network terminology and rigorously define Jukes-Cantor network models to show how to obtain a probability distribution on DNA site patterns from a network. In Section 3, we introduce the concept of generic identifiability and the algebraic background necessary to prove the main results. Finally, in Section 4, we present the main results about the identifiability of the semi-directed network topology. As Figure 1 illustrates, the network topologies are not generically identifiable, but we will be able to restrict to a class of networks that preserves identifiability. Additionally, we will be able to show many specific instances where identifiability fails. In Section 5, we conclude with a discussion about the consequences of these results for inferring phylogenies.
2. Phylogenetic Networks
The following network notation and terminology is adapted from [8, 9, 28].
Definition 2.1**.**
A phylogenetic network on is a rooted acyclic digraph with no edges in parallel and satisfying the following properties:
- (i)
the root has out-degree two; 2. (ii)
a vertex with out-degree zero has in-degree one, and the set of vertices with out-degree zero is ; 3. (iii)
all other vertices either have in-degree one and out-degree two, or in-degree two and out-degree one.
Note that these are sometimes also referred to as binary phylogenetic networks. A vertex with indegree one and outdegree two is called a tree vertex and a vertex with indegree two and outdegree one is called a reticulation vertex or simply a reticulation. Edges directed into a reticulation edge are called reticulation edges and all other edges are called tree edges. Recall that the reason for introducing phylogenetic networks is to model possible hybridization events and horizontal gene transfer. These events of course can only occur when two species coexist in time. Considering only directed acyclic graphs precludes any paradoxes wherein genetic information is transported back in time. However, as noted in the introduction, due to the time-reversibility of the Jukes-Cantor model, we will not actually be able to identify the location of the root in the network from the models we define. Therefore, in this paper we are primarily interested in recovering the underlying semi-directed network topology of a phylogenetic network. The semi-directed network topology is obtained from a phylogenetic network by suppressing the root node and undirecting all tree edges while the reticulation edges remain directed.
The class of phylogenetic networks is quite large and the algebraic approach we adopt becomes increasingly complicated as the number of reticulations in the network increases. Therefore, we begin here by studying networks that contain at most one reticulation vertex. Such structures are necessarily level-1 networks [5], networks in which every edge belongs to at most one cycle. In fact, networks with exactly one reticulation edge contain a single undirected cycle, which motivates the following definition.
Definition 2.2**.**
A cycle-network is a semi-directed network with one reticulation vertex. A k-cycle network is a cycle-network with cycle size .
Note that the cycle of a cycle-network always contains the reticulation vertex and the two reticulation edges. We will refer to an internal vertex contained in the cycle of a cycle-network as a cycle vertex. For subsequent sections, it will be helpful to establish some conventions for -leaf -cycle networks. We can view an -leaf -cycle network as a -cycle with a tree affixed to each cycle vertex . Let be the leaf label set of . Then the cycle vertices of induce an ordered partition of . Label the reticulation vertex by and label the remaining cycle vertices in a clockwise fashion so that . Using the shorthand for , induces the ordered partition . As an example, Figure 2 depicts a 5-cycle network and the caption describes the ordered partition that it induces. We call the unique -leaf -cycle network topology the -sunlet.
In Section 4.2, we will be concerned with -cycle networks with . This motivates the following definition.
Definition 2.3**.**
The set of large-cycle networks is the collection of all -cycle networks with .
The set of large-cycle networks is the focus of our main theorem, Theorem 1.1.
2.1. Obtaining a distribution from a phylogenetic network
In this section we describe how to obtain a probability distribution on -tuples of DNA bases from an -leaf phylogenetic network. As the network models we wish to discuss are a generalization of the nucleotide substitution models on phylogenetic trees, we begin by briefly reviewing these models.
2.1.1. Markov processes on phylogenetic trees
A phylogenetic model is a statistical model of molecular sequence evolution for a collection of taxa at a single DNA site. The tree parameter of such a model is an -leaf rooted leaf-labeled tree where the leaf vertices are labeled by the taxa. The internal nodes of the tree represent ancestors of the taxa at the leaves. We denote the root of the tree by and associate to each node of a random variable with state space , corresponding to the four DNA bases. The state of the random variable is meant to indicate the DNA base at the particular site being modeled in the taxon at .
Let be the root distribution with , and associate to each edge of a transition matrix where the rows and columns are indexed by the elements of the state space. Assuming is the vertex closer to the root, is equal to the conditional probability . The entries of the transition matrices are called the stochastic parameters of the model. For a particular choice of parameters, the model returns a probability distribution on the set of -tuples of DNA bases that may be observed at the leaves of . To compute this distribution, we first consider an assignment of states to the vertices of where is the state of . Then the probability of observing the state can be computed using the root distribution and the transition matrices. Specifically, letting be the set of edges of , this probability is equal to
[TABLE]
Notice that this is a monomial in the stochastic parameters of the model. The probability of observing a particular state at the leaves can be obtained by marginalization, i.e. summing over all possible states of the internal nodes. Therefore, the distribution on all -tuples of possible leaf states is given by a polynomial map from the stochastic parameter space to the probability simplex
[TABLE]
The model described above is referred to as the general Markov model; other specific phylogenetic models can be obtained by restricting the stochastic parameters. For example, for the Jukes-Cantor model, all transition matrices are assumed to be of the form pictured in Figure 3. Because the rows of this matrix must sum to one, there is essentially a single parameter for each edge.
Once a particular substitution model and a tree are specified, the image of the map is called the model associated to , denoted . The fact that is a polynomial map makes the model amenable to study with algebraic geometry.
2.1.2. Markov processes on phylogenetic networks
Here we describe how to obtain a distribution on -tuples of DNA bases from a phylogenetic network by taking a convex combination of the distributions from phylogenetic tree models. These network models are also described in [24, §3.3]. For this exposition, we assume that the network is a tree-child network [3], that is, we assume that the child of a reticulation vertex is always a tree vertex.
Let be an -leaf phylogenetic network and associate a transition matrix from a nucleotide substitution model to each edge of . Suppose has reticulation vertices . Since each has indegree two, there are two edges, and , directed into . For , independently delete with probability , otherwise, delete . Intuitively, the parameter corresponds to the probability that a particular site was inherited along edge . Encode this set of choices with a binary vector where a [math] in the th coordinate indicates that edge was deleted. After deleting the edges, the result is a rooted -leaf tree with a set of transition matrices and corresponding probability distribution on the leaf states. We can then define a distribution on -tuples of DNA bases from the network as follows
[TABLE]
Notice, that while the phylogenetic network model is a mixture model, it is not simply a -tree phylogenetic mixture model. This is because in a phylogenetic mixture model, the entries of the transition matrices are chosen independently for each of the trees in the mixture. However, in the network model, the transition matrix parameters are chosen for the network edges and then inherited by the trees embedded in the network as pictured in Figure 4. It is still the case that the network model has a polynomial parameterization from the stochastic parameter space of the network to the probability simplex. For example, in the case we are considering in this paper, where , we can denote the tree obtained by deleting by and the tree obtained by deleting by . Then the model of the network is the image of the polynomial map from the parameter space of the network to the probability simplex,
[TABLE]
It is also worth noting that and may have the same topology but where the network parameters associated to the edges are different. This is the case with the 3-cycle network depicted in Figure 4.
Example 2.4**.**
The network in Figure 4 is a network with two reticulation edges and labeled with the transition matrices and respectively (in Figure 4, the ’s in the superscripts are suppressed for aesthetics). We delete and keep with probability , and delete and keep with probability . These two possibilities give rise to the two trees and in Figure 4. Notice how the transition matrices on and are inherited from . Thus, we can view the Markov model on as a 2-tree mixture model with additional algebraic relationships among the stochastic parameters.
The description we have given above works for any nucleotide substitution model. However, in a time-reversible model, the location of the root is unidentifiable [7]. Therefore, for a time-reversible model, we obtain the same distribution by computing each after unrooting the tree . In fact, we obtain the same distribution as from if we instead define the model on the semi-directed network topology of . This implies that for a time-reversible model, any two networks that share the same semi-directed network topology, such as the two networks pictured in Figure 5, will yield the same distribution.
3. Algebraic statistics and generic identifiability
One of the insights of algebraic statistics is that many properties of phylogenetic models can be determined by ignoring the stochastic restrictions on the parameters and regarding as a complex polynomial map. Thus, to answer many questions, it is often enough to consider only the Zariski closure of the image of . This is a complex algebraic variety which we denote . Likewise, in this paper, we will work with the Zariski closure of the image of , the algebraic variety . Once we have made this change we refer to the formerly stochastic parameters of the model as the numerical parameters to distinguish them from the network parameter. Assuming there are stochastic parameters, we slightly abuse notation and write this new map as .
An important question about any model is whether or not the parameters of the model are identifiable. For phylogenetic network models, the identifiability of the underlying network parameter is particularly important. If we are able to find a network and a choice of stochastic parameters that yield a distribution that matches our data, we would like to infer the history of the taxa under consideration from the network topology. To do this, we must ensure that the network we have found is the only such network for which it is possible to do so. More formally, the network topology of an -leaf network model is identifiable if given any two -leaf networks and , the intersection of their models is empty. This notion of identifiability tends to be too strong in practice, and instead, it is often only possible to prove generic identifiability.
Definition 3.1**.**
The network parameter of a phylogenetic network model is generically identifiable if given any two -leaf networks, and , the set of parameters in that maps into is a set of Lebesgue measure zero.
In other words, the network parameter is generically identifiable, if, given a specific network, the distribution obtained from a generic choice of stochastic parameters could have only come from this network. To prove the generic identifiability of the network parameter of a phylogenetic network model, we will need to be able to distinguish networks.
Definition 3.2**.**
Two distinct -leaf networks and are distinguishable if is a proper subvariety of and of . Otherwise, they are indistinguishable.
Though distinguishability is phrased in terms of varieties, it will often be easier to work with the vanishing ideal of the network . The vanishing ideal is the set of polynomials that evaluate to zero everywhere on (or equivalently ). Ideals for -leaf network models are contained in polynomial rings where the indeterminates are indexed by -tuples of the DNA bases. That is, for a fixed choice of model and an -leaf network ,
[TABLE]
The elements of the ideal of a phylogenetic model are referred to as phylogenetic invariants, and they have played a key role in proving identifiability results for several phylogenetic models (e.g., [2, 20, 4, 27, 19]). The following proposition shows the connection between generic identifiability and distinguishing networks.
Proposition 3.3**.**
The network parameter of a phylogenetic network model is generically identifiable if for all , all pairs of -leaf networks are distinguishable.
Proof.
Let and be distinguishable -leaf networks. By definition, this means that is a proper subvariety of . This implies that there exists such that and so that
[TABLE]
is not identically zero. Therefore, the set of stochastic parameters in mapping into is contained in the proper algebraic subvariety
[TABLE]
This implies that the set of stochastic parameters in mapping into must be measure zero inside of . Otherwise, would include all of , and since the real numbers are Zariski dense, it would include all of , a contradiction. ∎
Remark*.*
While we define generic identifiability of the network parameter to be a condition on all pairs of network models from the class of -leaf network models, we could easily modify the definition to be a condition on all pairs of network models from a subclass of the -leaf network models. Furthermore, we can modify Proposition 3.3 to show the identifiability of the network parameter in a phylogenetic network model on a subclass of network models by showing all network models in the subclass are distinguishable.
The preceding remark will be important, since, as we will see in Section 4.1, arbitrary Jukes-Cantor networks are not distinguishable. We will even see that the same is true for the class of all cycle-networks, and we will have to restrict to the subclass of large-cycle networks to find a class of network models for which the semi-directed network topology is generically identifiable.
3.1. The Fourier-Hadamard Transform
Our approach for proving many of the results in this paper will be to use some of the computational algebra techniques outlined above. First, we will perform a linear change of coordinates called the Fourier-Hadamard transform [6, 34] that will make the parameterizations of the tree-based phylogenetic models monomial. This means that the cycle-network models will be parameterized by binomials, greatly reducing the computational time. Working in the transformed coordinates is common in phylogenetics, including applications involving mixture models [2, 20]. Since all of the computations referenced in the next section will be performed in Fourier coordinates, we provide here a basic explanation of the Fourier parameterization. More details can be found in [6, 34, 31].
The Fourier-Hadamard transform applies to a particular class of phylogenetic models called group-based models.
Definition 3.4**.**
A phylogenetic model is group-based if there exists a group , a map , and functions associated to the edges of such that, for each edge of ,
Our definition here is less general than is usually given and applies only to 4-state models of DNA evolution. However, the definition encompasses many commonly used models in phylogenetic applications including the Jukes-Cantor, the Kimura 2-parameter, and the Kimura 3-parameter models. The transformation applies to both the parameter space and the space of probability coordinates. To write the new parameterization, we let represent the set of edges of , and, for an edge , we write for the split of the leaf labels induced by removing from . We also use the symbols as shorthand for the group elements . The transformed parameters are written as , and the new coordinates of the image space are parameterized as follows
[TABLE]
Importantly, as was noted in [2], the linearity of the transform means that it applies to mixtures of tree-based phylogenetic models as well. As we have been careful to point out, the network models studied here are not the same as arbitrary -tree mixtures. However, we can obtain the transformed parameterization of a cycle-network model by identifying parameters in the Fourier parameterization of a -tree mixture.
For the Jukes-Cantor model, the group is and we arbitrarily set and then insist that for each edge of , . This implies that , and the stochastic condition in the probability space forces . We will ignore this last condition, which effectively homogenizes the parameterization and allows us to work projectively. The example below shows how to obtain the parameterization for a 4-leaf 3-cycle network.
Example 3.5**.**
Shown below is the parameterization of a Fourier coordinate of the 4-leaf 3-cycle network pictured in Figure 6. To simplify the notation we use for the parameters rather than . The first term of the parameterization is the parameterization of the tree induced by removing the reticulation edge and the second from removing the edge .
[TABLE]
Notice, we can reparameterize this variety by replacing with and with . Thus, we can write
[TABLE]
4. Identifiability of cycle-networks
One of the techniques that has proven successful for establishing identifiability for phylogenetic mixture models is to first establish the result for mixtures on trees with a few leaves. The idea is then to show that distinct mixtures on trees with many leaves can always be restricted to a subset of the leaves on which they remain distinct. Our idea here is essentially the same: first prove some identifiability results for cycle-networks with few leaves and then show that these imply identifiability for cycle-networks with any number of leaves. To begin, we introduce the concept of restricting a phylogenetic network.
Definition 4.1**.**
Let be an -leaf phylogenetic network with root , and let . The restriction of to is the phylogenetic network constructed by
- (i)
Taking the union of all directed paths from to a leaf labeled by an element of . 2. (ii)
Deleting all vertices that lie above the last such vertex on all paths. 3. (iii)
Suppressing all degree two vertices other than the root. 4. (iv)
Removing all parallel edges. 5. (v)
Applying steps (iii) and (iv) until the network is a phylogenetic network.
The network constructed after step (i) of Definition 4.1 defines a network model which is the same as the phylogenetic network model . To see this, notice that it does not alter the model to delete pendant edges above the vertex described in (ii) and reroot at this vertex. This is because given any previous root distribution and pendant edge , we can simply remove and choose the new root distribution in the model to be . Likewise, any non-root degree two vertex has two edges incident to it. Those edges can be replaced by a single edge with transition matrix that is the product of the transition matrices of the two incident edges. If either of the incident edges was a reticulation edge, then the new edge is also a reticulation edge and keeps the same reticulation edge parameter.
Finally, if there are two parallel edges and , they must be reticulation edges and can be replaced by a single edge with transition matrix . Thus, it is clear that the phylogenetic network model on the network constructed after step (i) of Definition 4.1 is contained in . The other containment is easily realized by setting some of the transition matrices to the identity in the network constructed after step (i) of Definition 4.1. The utility of the restriction operation comes from the following proposition.
Proposition 4.2**.**
Let be an -leaf phylogenetic network and . Then is the image of the model under the marginalization map defined by marginalizing over all the states of the leaves labeled by elements of .
Proof.
Just as described for a tree in Section 2.1.1, given an assignment of states to the vertices of , we can compute the probability of observing this state using the root distribution, the transition matrices, and the reticulation edge parameters, . Let be a choice of parameters for the model on . The distribution can then be computed by marginalizing over all the states of the non-leaf vertices. The network constructed after step (i) of Definition 4.1 defines a distribution which is computed by further marginalizing over all states of the leaves not labeled by elements of . As we argued above, this distribution is contained in and is precisely the image of under . Therefore
Choosing any parameters yielding a distribution , we can choose matching parameters for the edges shared by and and extend this with any choice of parameters to the rest of . Then which implies that ∎
In Section 3 we argued that to prove identifiability for a class of -leaf networks, it is enough to show that any two networks in the class are distinguishable. One of the nice applications of Proposition 4.2 is the following result.
Proposition 4.3**.**
Let and be distinct -leaf networks and let . If and are distinguishable, then so are and .
Proof.
Suppose and are distinguishable. Without loss of generality, we will show is a proper subvariety of . By definition of distinguishable, we have . Therefore, there exists that vanishes on but not on . Letting be the ring homomorphism corresponding to , then the polynomial vanishes on but not on . Therefore, is a proper subvariety of . ∎
It is possible that after restricting, or indeed, even after unrooting, that a cycle-network becomes a 2-cycle network. Such a network must necessarily have parallel edges, which, as pointed out in the discussion preceding Proposition 4.2, can be suppressed without altering the network model. This implies immediately that the models for 2-cycle networks are phylogenetic tree models. This suggests that perhaps we should exclude 2-cycle networks entirely from the class of networks considered in order to preserve a statement about the generic identifiability of the network topology. However, even this will not be enough.
Proposition 4.4**.**
For the CFN, JC, K2P, and K3P models, the ideal for any 3-leaf 3-cycle network is the zero ideal.
Proof.
Up to relabeling, there is only one 3-leaf 3-cycle network topology (Figure 7), and by computation, we can verify the ideal of this network is trivial for CFN, JC, K2P, and K3P. ∎
Remark*.*
For the computations in Proposition 4.4 and all other computations referenced in this paper, we work modulo the set of linear invariants that hold for every -leaf network for the group-based model specified. All computations are performed in Macaulay2 [11] and are available in the supplementary materials available on the authors’ websites.
Therefore, in order to find a class of network models for which we can establish indentifiability results, we need to start by considering at least 4-leaf networks. For the 3-leaf 3-cycle network, if we restrict the parameter space by setting each of the leaf transition matrices to the identity matrix, the corresponding variety still fills the ambient space. This fact will later prove important when investigating which 4-leaf network ideals are contained in one another.
4.1. Distinguishing 4-leaf Networks
After unrooting and removing parallel edges, there are, up to relabeling, only four semi-directed 4-leaf cycle-network topologies. One of these is the 4-leaf unrooted tree itself, the other three are pictured in Figure 8.
Therefore, up to an action of on the leaf labels, there are at most four different 4-leaf cycle-network ideals. In fact, there are exactly three for the Jukes-Cantor model.
Proposition 4.5**.**
The Jukes-Cantor network ideals for the two 4-leaf -cycle networks labeled as in Figure 8 are equal.
Proof.
The parameterization for the variety of in the Fourier coordinates is given by
[TABLE]
for each . The term in parentheses is exactly the parameterization of the Fourier coordinate for the variety of the 3-leaf 3-cycle network we obtain by pruning off the leaves and from . Likewise, letting be the Fourier parameters for , we have
[TABLE]
Again, the term in parentheses is the parameterization of for the variety of the 3-leaf 3-cycle network we obtain by pruning the leaves and from .
Without loss of generality, specify the to obtain a point in . Since there are no invariants for any 3-leaf 3-cycle network, for a generic choice of parameters, we can choose the for so that
[TABLE]
for all . Further choosing and shows that this point is also in . Since a generic choice of parameters for must also map into , it must be that . ∎
Remark*.*
Proposition 4.5 can be proven more succinctly using the toric fiber product. The toric fiber product [32] is a procedure that takes two homogeneous ideals (not necessarily toric) in rings with a compatible grading and produces a new homogeneous ideal. For phylogenetic tree models, it has been shown that the toric fiber product can be used to construct the ideal associated to a phylogenetic tree by “gluing” together the ideals associated to claw trees. Though we do not develop the full machinery here, the details for cycle-networks closely parallel the situation described for trees in [32, Section 3.4]. In Proposition 4.5, and can both be constructed by gluing a 3-leaf claw tree to the 3-sunlet along the edges and respectively. The proof of Proposition 4.5 then follows immediately since both network ideals are equal to the toric fiber product of the zero ideal and the ideal for the 3-leaf claw tree.
There are ways to label each of the three -leaf cycle-network topologies. However, many of these labelings result in the same ideal. For example, swapping the labels and in the -cycle network in Figure 8 does not change the network. The following proposition classifies all the ideals associated to 4-leaf Jukes-Cantor cycle-networks.
Proposition 4.6**.**
For 4-leaf Jukes-Cantor cycle-networks, there are
- •
3 ideals corresponding to 2-cycle networks (trees) that are 6-dimensional.
- •
6 ideals corresponding to 3-cycle networks that are 7-dimensional.
- •
12 ideals corresponding to 4-cycle networks that are 8-dimensional.
Notice that this situation is in sharp contrast to the case of Jukes-Cantor mixture models, and, indeed, all other group-based mixture models, where equality between the number of leaves implies equality between the ideal dimensions [2, 20, 12].
The dimension results in Proposition 4.6 are obtained by computing the ideals in Macaulay2 [11]; the computations are available in the supplementary materials. Our approach for each ideal is to first obtain a set of elements in the ideals by computing the ideal only up to a certain degree. We then use the rank of the Jacobian matrix to construct a lower bound on the dimension of the ideal. Finally, we verify that the elements found in low degree generate a prime ideal of the correct dimension, and hence, form a generating set.
We include at this point a classification of and -cycle network ideals for the CFN model. This proposition suggests that it may be difficult or impossible to obtain strong generic identifiability results for CFN networks and provides motivation for beginning with the Jukes-Cantor model.
Proposition 4.7**.**
For 4-leaf CFN cycle-networks, there are
- •
3 ideals corresponding to 2-cycle (trees) and 3-cycle networks that are 6-dimensional.
- •
3 ideals corresponding to 4-cycle networks that are 7-dimensional.
Returning again to Jukes-Cantor networks, we have the following corollaries to Proposition 4.6.
Corollary 4.8**.**
Let be a -cycle network and be a -cycle network. If , then and
Corollary 4.9**.**
Let and be distinct -leaf -cycle networks. Then , , , and
The network ideals described in Proposition 4.6 of the same dimension differ only by a permutation of the coordinates. Since each network ideal is parameterized, the ideal can be written as the kernel of a homomorphism, and, consequently, it is prime. If an ideal contains a prime ideal of the same dimension, then the two ideals are equal. Therefore, the network ideals of the same dimension are either equal or distinguishable.
The poset pictured in Figure 9 shows the containment relationships among the 21 equivalence classes of 4-leaf cycle-network ideals. It is possible to verify the containment of the 3-cycle network varieties inside the 4-cycle network varieties by showing the reverse inclusion of the ideals by computation. However, we can also see this from the structure of the networks themselves.
Example 4.10**.**
Choose each of the Fourier parameters associated to the edge to be equal to 1 in the 4-cycle network from Figure 8. This essentially collapses the edge in the network to produce a 4-leaf 3-cycle network. We can construct this new network by attaching a 1-2 cherry to a 3-leaf 3-cycle network with a single leaf edge removed. As noted after Proposition 4.4, there are no invariants for the 3-leaf 3-cycle network even with all leaf edges removed. Therefore, the same arguments from the proof of Proposition 4.5 show that the variety for this network is equal to both and . Therefore, both of these network varieties are contained in . Collapsing the other solid edge in the cycle of the 4-cycle network shows that also contains the variety of any 4-leaf 3-cycle network with a 2-3 cherry.
4.2. Distinguishing Large-Cycle Networks
In the previous section, we showed that it is possible to distinguish some cycle-networks with only a few leaves from one another by computing the ideals for these networks explicitly. In this section, we collect the results needed to prove Theorem 1.1. That is, we will show that if is an -leaf -cycle network and is a distinct -leaf -cycle network with , then .
The three lemmas below address the three cases, where , , and . In each of the lemmas we will assume that is a -cycle network and is a -cycle network. We assume that the cycle vertices of and are labeled according to the convention described in Section 2 so that the induced partition of is in and in . The goal in each case of each of the lemmas below will be to find such that , which by Proposition 4.3, implies that . One result that will use repeatedly is the generic identifiability of the tree topology of a Jukes-Cantor tree model. This is a well-known result with multiple independent proofs [1, 30].
Lemma 4.11**.**
Let and be two distinct -cycle networks with . Then and .
Proof.
Case 1:
Since there exist with and such that while and . Let contain and two additional leaf labels so that is a 4-leaf 4-cycle network. Since , is either a 2 or 3-cycle network. In either case, Corollary 4.8 implies that . By a similar argument, we can show that .
Case 2: .
If , then we can assume that if then , else the desired result follows from the result for trees. Furthermore, we can assume that there exists an such that (else ). Thus, is simply a reordering of , i.e. . Without loss of generality, we can view each network as a -sunlet network, with the pendant edges of labeled starting from the reticulation vertex and proceeding clockwise and with the pendant edges of labeled starting from the reticulation vertex and proceeding clockwise.
Assume and let where are two additional leaves with . Then, since they do not have the same reticulation vertex, and are distinct 4-leaf 4-cycle networks, and so by Corollary 4.9, and . If , then choose to be . Then is a caterpillar tree with cherries labeled by and and is a caterpillar tree with cherries labeled by and . If these trees are not identical, then again, results for single tree models imply and . If these trees are identical, then since it must be that for either or for , and are distinct 4-leaf 4-cycle networks, and the result follows again by Corollary 4.9.
∎
Lemma 4.12**.**
Let be a -cycle network and be a -cycle network with . Then .
Proof.
We may assume , since otherwise the result follows from Corollary 4.8. Since , there exist with and such that while and . As in Lemma 4.11 let contain and two additional leaf labels so that is a 4-leaf 4-cycle network. Again, must be either a tree or a -cycle network and the result follows by Corollary 4.8 and Proposition 4.3. ∎
Lemma 4.13**.**
Let be a -cycle network and be a -cycle network with . Then .
Proof.
By similar arguments to those in the proof of Lemma 4.11, if is not a refinement of , then . Thus, we will assume that is a refinement of .
Since refines , there exist such that . Construct the set consisting of , , and any three other leaf labels so that is a 5-leaf 4-cycle network. Now by construction, must be a 5-leaf 5-cycle network. Thus, up to relabeling, we can assume that is the 5-sunlet network with the leaf labeled by 1 attached to the reticulation vertex and all other leaves labeled in consecutive order around the sunlet. could now be any one of several 5-leaf 4-cycle networks. However, if it is any network other than one of the two pictured in Figure 10, then there exists a 4-element subset such that either and are distinct trees or is a 3-cycle network and is a tree. In either event, this would imply and the result follows. Thus, we may assume that is one of the two networks pictured in Figure 10.
Representing , , , and by [math], , , and , the following cubic is in the 5-sunlet network ideal
[TABLE]
Substituting the parameterization for each of the two 4-cycle networks pictured into this polynomial, we find that it must vanish on but not on . Thus, . ∎
Finally, we are able to give the proof of the main theorem.
Proof of Theorem 1.1.
By Proposition 3.3, the network parameter of a phylogenetic network model is generically identifiable if for all , all pairs of -leaf networks are distinguishable. So let be an -leaf -cycle network and be an -leaf -cycle network with . By application of one of Lemmas 4.11, 4.12, or 4.13, and are distinguishable. ∎
5. Discussion and Open Problems
We have shown that the semi-directed network topology of a Jukes-Cantor network is not necessarily identifiable even when restricting to networks with a single reticulation vertex. In fact, we need to further restrict to the class of large-cycle networks in order for the semi-directed network topology to be generically identifiable. While this identifiability result covers a large subset of cycle-networks, models on networks with small cycles may be of biological interest, and thus, exploration on how to use these models effectively is required.
Furthermore, this paper introduces a collection of algebraic varieties worth deeper investigation. The varieties in this paper are subvarieties of the join varieties associated to 2-tree mixture models, but, as we have seen, the class of network model varieties has different properties than the class of 2-tree mixture varieties. Thus, there remain a number of interesting mathematical questions to address. For example, the results of Lemma 4.6 might suggest that for networks with -leaves, the dimension of the network model increases with cycle size. However, this has not been proven, and indeed, we propose the following conjecture to the contrary.
Conjecture 5.1**.**
Let and be two -leaf large-cycle networks, then .
While we are unable to compute the full ideals, the rank of the Jacobian matrix evaluated at a random point for the 5-leaf -cycle and -sunlet networks is the same, suggesting that both ideals are the same dimension. Observe also, that the argument from Example 4.10 does not apply to 5-sunlet networks. This is because when we collapse one of the cycle edges of the 5-sunlet network, the resulting network is not binary. This was also the case when we collapsed an edge in the 4-cycle network in Example 4.10. The resulting variety was equivalent to the variety of a 3-cycle binary network only since the variety of the 3-leaf 3-cycle network with all leaf edges collapsed fills the entire space. Thus, it would be interesting to determine if this dimension phenomenon is isolated to those networks with small cycle size.
The next question we pose comes from the logical step of extending this work to other models.
Question 5.2**.**
Is the semi-directed network topology parameter generically identifiable for large-cycle Kimura-2 parameter (Kimura-3 parameter) network models?
Many of the same techniques should prove fruitful for these models. The combinatorial arguments used in this paper will apply, though the number of parameters and the size of the rings may make the computational steps much more difficult. Still, since K2P and K3P are group-based, the Fourier transform applies and finding the necessary invariants is at least within the realm of possibility. All this of course is only to address the identifiability of the semi-directed network topology. It would also be of practical interest to determine the identifiability of the transition matrix parameters.
The identifiability results we obtain in this paper suggest that there could be more identifiability issues as we increase the number of reticulation vertices in the network. For this reason, applying a similar approach to more general classes of networks would be of great interest. One of the key tools in this paper is Lemma 4.3, which allows us to prove the identifiability of networks with any number of leaves by considering only networks with fewer than five leaves in Section 4.2. However, it has already been shown that arbitrary networks cannot be identified by their subnetworks [13], so this approach has little hope of succeeding in that case. Instead, the next step might be to examine tree-child networks which are identifiable from their trinets, induced subnetworks on three leaves [15]. There are some subtleties involved here as well, as the result for trinets applies to the rooted network topology and we have already seen that there will be indistinguishable 3-leaf tree-child networks. Still, it may be possible to make similar arguments by restricting to -leaf subnetworks for some fixed . Finally, it may be worthwhile to start by examining slightly more general classes of networks, such as level-1 networks with only two reticulation vertices.
6. Acknowledgements
We would like to thank Seth Sullivant for his insights regarding the toric fiber product and cycle-network ideals. Colby Long is supported by the Mathematical Biosciences Institute and the National Science Foundation under grant DMS-1440386. Elizabeth Gross is supported by the National Science Foundation under grant DMS-1620109.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Reconstruction of evolutionary trees from pairwise distributions on current species. In E.M. Keramidas, editor, Computing Science and Statistics: Proceedings of the 23rd Symposium on the Interface , pages 254–257, Fairfax Station, VA, 1991. Interface Foundation.
- 2[2] Elizabeth S. Allman, Sonja Petrović, John A. Rhodes, and Seth Sullivant. Identifiability of 2-tree mixtures for group-based models. IEEE/ACM Trans. Comp. Biol. Bioinformatics , 8(3):710–722, 2011.
- 3[3] Gabriel Cardona, Francesc Rosseló, and Gabriel Valiente. Comparison of tree-child phylogenetic networks. IEEE/ACM Trans. Comp. Biol. Bioinformatics , 6:552–569, 2007.
- 4[4] J. Chifman and L. Kubatko. Identifiability of the unrooted species tree topology under the coalescent model with time specific rate variation and invariable sites. Journal of Theoretical Biology , 374:35–47, 2015.
- 5[5] Charles Choy, Jesper Jansson, Kunihiko Sadakane, and Wing-Kin Sung. Computing the maximum agreement of phylogenetic networks. Electronic Notes in Theoretical Computer Science , 91:134–147, 2004.
- 6[6] S.N. Evans and T.P. Speed. Invariants of some probability models used in phylogenetic inference. Ann. Statist. , 21(1):355–377, 1993.
- 7[7] Joseph Felsenstein. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. , 17:368–376, 1981.
- 8[8] Andrew Francis, Charles Semple, and Mike Steel. New characterisations of tree-based networks and proximity measures. ar Xiv:1611.04225 .
