Text Classification with Compression Algorithms
Antonio Giuliano Zippo

TL;DR
This paper introduces a text classification method using compression-based kernels that are language-independent and can capture long-range dependencies, showing competitive accuracy on standard datasets.
Contribution
It proposes a novel kernel function based on compression algorithms for text similarity, offering an alternative to traditional feature-based methods.
Findings
Compression kernels outperform Gaussian, linear, and polynomial kernels on certain datasets.
The method is language independent and requires no text preprocessing.
Computational time is high, and performance on non-text datasets is poor.
Abstract
This work concerns a comparison of SVM kernel methods in text categorization tasks. In particular I define a kernel function that estimates the similarity between two objects computing by their compressed lengths. In fact, compression algorithms can detect arbitrarily long dependencies within the text strings. Data text vectorization looses information in feature extractions and is highly sensitive by textual language. Furthermore, these methods are language independent and require no text preprocessing. Moreover, the accuracy computed on the datasets (Web-KB, 20ng and Reuters-21578), in some case, is greater than Gaussian, linear and polynomial kernels. The method limits are represented by computational time complexity of the Gram matrix and by very poor performance on non-textual datasets.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Computability, Logic, AI Algorithms · semigroups and automata theory
