Artificial Sequences and Complexity Measures

Andrea Baronchelli; Emanuele Caglioti; Vittorio Loreto

arXiv:cond-mat/0403233·cond-mat.stat-mech·November 10, 2009

Artificial Sequences and Complexity Measures

Andrea Baronchelli, Emanuele Caglioti, Vittorio Loreto

PDF

TL;DR

This paper introduces a novel, compression-based information measure for character sequences, enabling automatic, language-independent analysis for tasks like language recognition and authorship attribution.

Contribution

It presents a new class of methods using data compression to quantify sequence similarity and extract information, applicable across diverse data types.

Findings

01

Effective in language recognition tasks

02

Accurate authorship attribution results

03

Versatile across different types of character data

Abstract

In this paper we exploit concepts of information theory to address the fundamental problem of identifying and defining the most suitable tools to extract, in a automatic and agnostic way, information from a generic string of characters. We introduce in particular a class of methods which use in a crucial way data compression techniques in order to define a measure of remoteness and distance between pairs of sequences of characters (e.g. texts) based on their relative information content. We also discuss in detail how specific features of data compression techniques could be used to introduce the notion of dictionary of a given sequence and of Artificial Text and we show how these new tools can be used for information extraction purposes. We point out the versatility and generality of our method that applies to any kind of corpora of character strings independently of the type of coding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.