ATHENA: Automated Tuning of Genomic Error Correction Algorithms using Language Models
Mustafa Abdallah, Ashraf Mahgoub, Saurabh Bagchi, Somali Chaterji

TL;DR
This paper introduces ATHENA, an NLP-inspired method that automatically tunes error correction parameters for genomic sequencing data using language models and perplexity metrics, improving correction without reference genomes.
Contribution
ATHENA applies language modeling techniques to optimize error correction parameters in genomics, enabling data-driven tuning adaptable to different datasets without requiring reference genomes.
Findings
Perplexity correlates negatively with error correction quality.
Language models effectively guide parameter tuning in genomic error correction.
Method works for both de novo and resequencing datasets.
Abstract
The performance of most error-correction algorithms that operate on genomic sequencer reads is dependent on the proper choice of its configuration parameters, such as the value of k in k-mer based techniques. In this work, we target the problem of finding the best values of these configuration parameters to optimize error correction. We perform this in a data-driven manner, due to the observation that different configuration parameters are optimal for different datasets, i.e., from different instruments and organisms. We use language modeling techniques from the Natural Language Processing (NLP) domain in our algorithmic suite, Athena, to automatically tune the performance-sensitive configuration parameters. Through the use of N-Gram and Recurrent Neural Network (RNN) language modeling, we validate the intuition that the EC performance can be computed quantitatively and efficiently…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Machine Learning in Bioinformatics · Natural Language Processing Techniques
