NASCUP: Nucleic Acid Sequence Classification by Universal Probability

Sunyoung Kwon; Gyuwan Kim; Byunghan Lee; Jongsik Chun; Sungroh Yoon,; and Young-Han Kim

arXiv:1511.04944·q-bio.GN·November 30, 2018

NASCUP: Nucleic Acid Sequence Classification by Universal Probability

Sunyoung Kwon, Gyuwan Kim, Byunghan Lee, Jongsik Chun, Sungroh Yoon,, and Young-Han Kim

PDF

Open Access 1 Repo

TL;DR

NASCUP is a fast, accurate nucleic acid sequence classification method using universal probability and context-tree models, outperforming traditional tools in large-scale bioinformatics tasks.

Contribution

It introduces NASCUP, a novel classification approach that leverages statistical structures and universal probability for efficient large-scale nucleotide sequence analysis.

Findings

01

Achieved BLAST-like accuracy on large databases

02

Reduced runtime by orders of magnitude

03

Applicable to outlier detection and synthetic sequence generation

Abstract

Motivated by the need for fast and accurate classification of unlabeled nucleotide sequences on a large scale, we developed NASCUP, a new classification method that captures statistical structures of nucleotide sequences by compact context-tree models and universal probability from information theory. NASCUP achieved BLAST-like classification accuracy consistently for several large-scale databases in orders-of-magnitude reduced runtime, and was applied to other bioinformatics tasks such as outlier detection and synthetic sequence generation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nascup/nascup
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Bioinformatics · Genomics and Phylogenetic Studies · Advanced Proteomics Techniques and Applications