Text Classification with Compression Algorithms

Antonio Giuliano Zippo

arXiv:1210.7657·cs.LG·October 30, 2012·1 cites

Text Classification with Compression Algorithms

Antonio Giuliano Zippo

PDF

Open Access

TL;DR

This paper introduces a text classification method using compression-based kernels that are language-independent and can capture long-range dependencies, showing competitive accuracy on standard datasets.

Contribution

It proposes a novel kernel function based on compression algorithms for text similarity, offering an alternative to traditional feature-based methods.

Findings

01

Compression kernels outperform Gaussian, linear, and polynomial kernels on certain datasets.

02

The method is language independent and requires no text preprocessing.

03

Computational time is high, and performance on non-text datasets is poor.

Abstract

This work concerns a comparison of SVM kernel methods in text categorization tasks. In particular I define a kernel function that estimates the similarity between two objects computing by their compressed lengths. In fact, compression algorithms can detect arbitrarily long dependencies within the text strings. Data text vectorization looses information in feature extractions and is highly sensitive by textual language. Furthermore, these methods are language independent and require no text preprocessing. Moreover, the accuracy computed on the datasets (Web-KB, 20ng and Reuters-21578), in some case, is greater than Gaussian, linear and polynomial kernels. The method limits are represented by computational time complexity of the Gram matrix and by very poor performance on non-textual datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Computability, Logic, AI Algorithms · semigroups and automata theory