Dictionary based methods for information extraction
A. Baronchelli, E. Caglioti, V. Loreto, E. Pizzi

TL;DR
This paper introduces a novel information extraction method leveraging data compression dictionaries, demonstrating effective sequence comparison and classification, especially useful for analyzing complex data like DNA strings.
Contribution
It presents a new dictionary-based approach for information extraction and sequence comparison, expanding the application of data compression techniques in data analysis.
Findings
Effective sequence comparison using dictionary-created sequences
Good results in various classification contexts
Potential applications in DNA sequence analysis
Abstract
In this paper we present a general method for information extraction that exploits the features of data compression techniques. We first define and focus our attention on the so-called "dictionary" of a sequence. Dictionaries are intrinsically interesting and a study of their features can be of great usefulness to investigate the properties of the sequences they have been extracted from (e.g. DNA strings). We then describe a procedure of string comparison between dictionary-created sequences (or "artificial texts") that gives very good results in several contexts. We finally present some results on self-consistent classification problems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Fractal and DNA sequence analysis · semigroups and automata theory
