Incorporating indel channels into average-case analysis of seed-chain-extend
Spencer Gibson, Yun William Yu

TL;DR
This paper extends the theoretical analysis of seed-chain-extend algorithms to include insertions and deletions, demonstrating that high recoverability and manageable runtime are achievable under realistic mutation models.
Contribution
It introduces new mathematical tools to analyze seed-chain-extend performance with indels, providing bounds on recoverability and runtime in more realistic mutation scenarios.
Findings
Expected recoverability is at least 1 - O(1/√m).
Expected runtime is O(mn^{3.15·θ_T} log n).
Results hold when total mutation rate is less than 0.159.
Abstract
Given a sequence of letters drawn i.i.d. from an alphabet of size and a mutated substring of length , we often want to recover the mutation history that generated from . Modern sequence aligners are widely used for this task, and many employ the seed-chain-extend heuristic with -mer seeds. Previously, Shaw and Yu showed that optimal linear-gap cost chaining can produce a chain with recoverability, the proportion of the mutation history that is recovered, in expected time, where is the mutation rate under a substitution-only channel and is assumed to be uniformly random. However, a gap remains between theory and practice, since real genomic data includes insertions and deletions (indels), and yet seed-chain-extend remains effective. In this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenome Rearrangement Algorithms · DNA and Biological Computing · Genomics and Phylogenetic Studies
