NASCUP: Nucleic Acid Sequence Classification by Universal Probability
Sunyoung Kwon, Gyuwan Kim, Byunghan Lee, Jongsik Chun, Sungroh Yoon,, and Young-Han Kim

TL;DR
NASCUP is a fast, accurate nucleic acid sequence classification method using universal probability and context-tree models, outperforming traditional tools in large-scale bioinformatics tasks.
Contribution
It introduces NASCUP, a novel classification approach that leverages statistical structures and universal probability for efficient large-scale nucleotide sequence analysis.
Findings
Achieved BLAST-like accuracy on large databases
Reduced runtime by orders of magnitude
Applicable to outlier detection and synthetic sequence generation
Abstract
Motivated by the need for fast and accurate classification of unlabeled nucleotide sequences on a large scale, we developed NASCUP, a new classification method that captures statistical structures of nucleotide sequences by compact context-tree models and universal probability from information theory. NASCUP achieved BLAST-like classification accuracy consistently for several large-scale databases in orders-of-magnitude reduced runtime, and was applied to other bioinformatics tasks such as outlier detection and synthetic sequence generation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Advanced Proteomics Techniques and Applications
