GapPredict: A Language Model for Resolving Gaps in Draft Genome Assemblies
Eric Chen, Justin Chu, Jessica Zhang, Rene L. Warren, Inanc Birol

TL;DR
GapPredict is a deep learning-based tool that predicts and fills unresolved gaps in genome assembly scaffolds, significantly improving gap-filling performance over existing methods.
Contribution
This paper introduces GapPredict, a novel character-level language model for resolving gaps in genome assemblies, demonstrating its effectiveness compared to state-of-the-art tools.
Findings
GapPredict fills 65.6% of previously unfilled gaps.
Deep learning approaches enhance genome assembly gap-filling.
Benchmark shows improved performance over Sealer.
Abstract
Short-read DNA sequencing instruments can yield over 1e+12 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as "gaps". Here, we introduce GapPredict, a tool that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter, demonstrating the practical utility of deep learning approaches to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · RNA and protein synthesis mechanisms · Genomics and Chromatin Dynamics
