PhyloGFN: Phylogenetic inference with generative flow networks

Mingyang Zhou; Zichao Yan; Elliot Layne; Nikolay Malkin; Dinghuai; Zhang; Moksh Jain; Mathieu Blanchette; Yoshua Bengio

arXiv:2310.08774·q-bio.PE·March 26, 2024·2 cites

PhyloGFN: Phylogenetic inference with generative flow networks

Mingyang Zhou, Zichao Yan, Elliot Layne, Nikolay Malkin, Dinghuai, Zhang, Moksh Jain, Mathieu Blanchette, Yoshua Bengio

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

PhyloGFN introduces a novel generative flow network approach to phylogenetic inference, effectively sampling complex tree structures and providing high-quality evolutionary hypotheses, outperforming existing methods in certain metrics.

Contribution

This paper presents PhyloGFN, the first application of GFlowNets to phylogenetics, enabling efficient sampling from complex posterior distributions over tree topologies.

Findings

01

Produces diverse, high-quality evolutionary hypotheses

02

Competitive in marginal likelihood estimation

03

Achieves closer fit to target distribution than variational methods

Abstract

Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history and numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- *Innovative Application*: Utilizing GFlowNets for phylogenetic inference is a novel idea, showcasing the versatility of GFlowNets in unique problem settings. - *Competitive Performance*: The approach not only matches up to VBPI-GNN-based methods in Bayesian posterior inference but offers capabilities beyond them, like generating from arbitrarily filled-in subtrees which preceding methods could not tackle. - *Estimating probability of suboptimal structures*: The ability to outperform VBPI metho

Weaknesses

- *Performance vs. Efficiency*: While GFlowNets might perform comparably to PAUP* in the parsimony-based inference setting, the real differentiator would be computational efficiency on a new inference task. Unfortunately, no wall-clock time data is provided, making it challenging to discern any advantages of GFlowNets in this scenario. - *Methodological Novelty*: The paper does not seem to bring forth significant machine learning methodological advancements, with much of the methodology being st

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

* The paper is clear and well-written. * The approach is interesting, conceptually simple, and provides a nice, efficient way to generate distributions over trees that support all of tree space (as opposed to relying on a pre-defined subset of tree space like VBPI). * The theoretical results and connections to Felsenstein's and Fitch's algorithms are really nice. * The performance across the full space of trees is promising.

Weaknesses

* I know that it is common in the Bayesian Phylogenetics field, but I am uncomfortable with using the Marginal log-likelihood (MLL) as a measure of the accuracy of the posterior. As the authors note, taking $K \to \infty$ in equation (5) or the log of (S1)results in the true MLL regardless of the distribution used. For finite $K$, both the bias and variance of the estimated MLL will depend on the learned posterior and it is incredibly difficult for me to compare methods. For example in Table

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

* Both the proposed model architecture and the use of GFlowNets for the chosen application are novel and interesting. * The paper conducts a large body of well-planned experiments, compares against a properly chosen set of baselines and reports results favorable to the proposed method. * The fact that the comparison is not made against many alternative methods is understandable, as there probably are not many modern machine learning methods addressing the same problem. * The design choices

Weaknesses

* The biggest weakness looks to me like the results in Table 1. The log-likelihood scores look very similar to each other. For instance -7108.95 for PhyloGFN and 7108.95 VBPI-GNN. Similar for other data sets. * It is great that the paper makes lots of effort to ensure the reproducibility of the results. However, It looks to me like the main paper lacks some essential details about the experiments, making it a bit hard for the reader to evaluate the results. See my question below. * Likewise

Code & Models

Repositories

zmy1116/phylogfn
pytorchOfficial

Videos

PhyloGFN: Phylogenetic inference with generative flow networks· slideslive

Taxonomy

TopicsGenomics and Phylogenetic Studies · Genetic diversity and population structure · Biomedical Text Mining and Ontologies

MethodsVariational Inference