A Note on the Compaction of long Training Sequences for Universal Classification -a Non-Probabilistic Approach
Jacob Ziv

TL;DR
This paper proposes a non-probabilistic method for compacting long training sequences into a suffix-tree with linear leaves, reducing storage needs while maintaining classification accuracy in sequence analysis.
Contribution
It introduces a universal data compaction technique for feature-based classifiers that does not rely on probabilistic models, applicable to biological sequence classification.
Findings
Suffix-tree with O(N) leaves for long training sequences
Minimal increase in classification error rate
Applicable to biological data without probabilistic assumptions
Abstract
One of the central problems in the classification of individual test sequences (e.g. genetic analysis), is that of checking for the similarity of sample test sequences as compared with a set of much longer training sequences. This is done by a set of classifiers for test sequences of length N, where each of the classifiers is trained by the training sequences so as to minimize the classification error rate when fed with each of the training sequences. It should be noted that the storage of long training sequences is considered to be a serious bottleneck in the next generation sequencing for Genome analysis Some popular classification algorithms adopt a probabilistic approach, by assuming that the sequences are realizations of some variable-length Markov process or a hidden Markov process (HMM), thus enabling the imbeding of the training data onto a variable-length Suffix-tree, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Fractal and DNA sequence analysis · Machine Learning in Bioinformatics
