Impossibility of phylogeny reconstruction from $k$-mer counts
Wai-Tong Louis Fan, Brandon Legried, Sebastien Roch

TL;DR
This paper proves that using only $k$-mer counts from full sequences is insufficient for consistent phylogeny reconstruction under a two-state model, highlighting the need for more advanced methods.
Contribution
It establishes a fundamental impossibility result showing that $k$-mer counts alone cannot reliably reconstruct phylogenies, emphasizing the necessity for more sophisticated approaches.
Findings
No consistent phylogeny estimation from $k$-mer counts alone for fixed $k$
Joint distributions of $k$-mer counts on different trees are statistically indistinguishable asymptotically
Statistical consistency requires additional techniques beyond simple $k$-mer counts
Abstract
We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed no statistically consistent phylogeny estimation is possible from -mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of -mer counts over the entire leaf sequences on two distinct trees have total variation distance bounded away from as the sequence length tends to infinity. Our impossibility result implies that statistical consistency requires more sophisticated use of -mer count information, such as block techniques developed in previous theoretical work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Algorithms and Data Compression · Genomics and Phylogenetic Studies
