Dictionary based methods for information extraction

A. Baronchelli; E. Caglioti; V. Loreto; E. Pizzi

arXiv:cond-mat/0402581·cond-mat.stat-mech·November 10, 2009

Dictionary based methods for information extraction

A. Baronchelli, E. Caglioti, V. Loreto, E. Pizzi

PDF

Open Access

TL;DR

This paper introduces a novel information extraction method leveraging data compression dictionaries, demonstrating effective sequence comparison and classification, especially useful for analyzing complex data like DNA strings.

Contribution

It presents a new dictionary-based approach for information extraction and sequence comparison, expanding the application of data compression techniques in data analysis.

Findings

01

Effective sequence comparison using dictionary-created sequences

02

Good results in various classification contexts

03

Potential applications in DNA sequence analysis

Abstract

In this paper we present a general method for information extraction that exploits the features of data compression techniques. We first define and focus our attention on the so-called "dictionary" of a sequence. Dictionaries are intrinsically interesting and a study of their features can be of great usefulness to investigate the properties of the sequences they have been extracted from (e.g. DNA strings). We then describe a procedure of string comparison between dictionary-created sequences (or "artificial texts") that gives very good results in several contexts. We finally present some results on self-consistent classification problems.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Fractal and DNA sequence analysis · semigroups and automata theory