A Note on the Compaction of long Training Sequences for Universal   Classification -a Non-Probabilistic Approach

Jacob Ziv

arXiv:1102.5482·cs.IT·June 24, 2014·1 cites

A Note on the Compaction of long Training Sequences for Universal Classification -a Non-Probabilistic Approach

Jacob Ziv

PDF

Open Access

TL;DR

This paper proposes a non-probabilistic method for compacting long training sequences into a suffix-tree with linear leaves, reducing storage needs while maintaining classification accuracy in sequence analysis.

Contribution

It introduces a universal data compaction technique for feature-based classifiers that does not rely on probabilistic models, applicable to biological sequence classification.

Findings

01

Suffix-tree with O(N) leaves for long training sequences

02

Minimal increase in classification error rate

03

Applicable to biological data without probabilistic assumptions

Abstract

One of the central problems in the classification of individual test sequences (e.g. genetic analysis), is that of checking for the similarity of sample test sequences as compared with a set of much longer training sequences. This is done by a set of classifiers for test sequences of length N, where each of the classifiers is trained by the training sequences so as to minimize the classification error rate when fed with each of the training sequences. It should be noted that the storage of long training sequences is considered to be a serious bottleneck in the next generation sequencing for Genome analysis Some popular classification algorithms adopt a probabilistic approach, by assuming that the sequences are realizations of some variable-length Markov process or a hidden Markov process (HMM), thus enabling the imbeding of the training data onto a variable-length Suffix-tree, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Fractal and DNA sequence analysis · Machine Learning in Bioinformatics