DNA Sequence Classification with Compressors
\c{S}\"ukr\"u Ozan

TL;DR
This paper introduces a resource-efficient, compressor-based DNA sequence classification method that maintains high accuracy while reducing computational demands, advancing scalable genomic analysis.
Contribution
It adapts a parameter-free compression approach for DNA classification using multiple algorithms, improving efficiency over traditional machine learning methods.
Findings
Effective classification across multiple species
Comparable accuracy to state-of-the-art methods
Enhanced resource efficiency and scalability
Abstract
Recent studies in DNA sequence classification have leveraged sophisticated machine learning techniques, achieving notable accuracy in categorizing complex genomic data. Among these, methods such as k-mer counting have proven effective in distinguishing sequences from varied species like chimpanzees, dogs, and humans, becoming a staple in contemporary genomic research. However, these approaches often demand extensive computational resources, posing a challenge in terms of scalability and efficiency. Addressing this issue, our study introduces a novel adaptation of Jiang et al.'s compressor-based, parameter-free classification method, specifically tailored for DNA sequence analysis. This innovative approach utilizes a variety of compression algorithms, such as Gzip, Brotli, and LZMA, to efficiently process and classify genomic sequences. Not only does this method align with the current…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Algorithms and Data Compression
MethodsALIGN
