On an Application of Relative Entropy
Dmitry V. Khmelev, William J. Teahan

TL;DR
This paper presents a method for classifying character sequences like texts and DNA using relative entropy estimated through compression and Markov Chains, demonstrating its effectiveness and comparing it to previous approaches.
Contribution
The paper introduces a simple, computationally efficient approach using first-order Markov Chains for estimating relative entropy in sequence classification tasks.
Findings
Markov Chain-based method is precise for sequence classification
The approach surpasses previous entropy estimation methods
The method is computationally effective
Abstract
We describe general approach to classification of character sequences (texts, DNA) using relative entropy estimated by off-the-shelf compression and Markov Chains and find them precise enough. We also notice that the method for estimating relative entropy described in the paper cond-mat/0108530 "Language Trees..." by D. Benedetto et al. was considered earlier and was found to be easily surpassed by the simple and computationally effective first order Markov Chain approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Bioinformatics · Fractal and DNA sequence analysis · Authorship Attribution and Profiling
