Knowledge distillation for fast and accurate DNA sequence correction
Anastasiya Belyaeva, Joel Shor, Daniel E. Cook, Kishwar Shafin, Daniel, Liu, Armin T\"opfer, Aaron M. Wenger, William J. Rowell, Howard Yang, Alexey, Kolesnikov, Cory Y. McLean, Maria Nattestad, Andrew Carroll, Pi-Chuan Chang

TL;DR
This paper introduces a distilled transformer model for DNA sequence correction that is faster, smaller, and more accurate than traditional HMM-based methods, significantly improving downstream genomic analysis tasks.
Contribution
The paper presents a novel distilled transformer-encoder model for DNA sequence correction that outperforms existing HMM-based methods in speed, size, and accuracy.
Findings
1.3x faster and 1.5x smaller than larger models.
Improves high-quality read yield (Q30) by 1.69x.
Reduces variant calling errors by 39% and improves genome assembly quality by 3.8%.
Abstract
Accurate genome sequencing can improve our understanding of biology and the genetic basis of disease. The standard approach for generating DNA sequences from PacBio instruments relies on HMM-based models. Here, we introduce Distilled DeepConsensus - a distilled transformer-encoder model for sequence correction, which improves upon the HMM-based methods with runtime constraints in mind. Distilled DeepConsensus is 1.3x faster and 1.5x smaller than its larger counterpart while improving the yield of high quality reads (Q30) over the HMM-based method by 1.69x (vs. 1.73x for larger model). With improved accuracy of genomic sequences, Distilled DeepConsensus improves downstream applications of genomic sequence analysis such as reducing variant calling errors by 39% (34% for larger model) and improving genome assembly quality by 3.8% (4.2% for larger model). We show that the representations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Algorithms and Data Compression · Machine Learning in Bioinformatics
