Data Representation and Compression Using Linear-Programming Approximations
Hristo S. Paskov, John C. Mitchell, Trevor J. Hastie

TL;DR
This paper introduces Dracula, a novel unsupervised feature selection framework for sequential data that uses linear programming to learn and compress dictionaries of n-grams, enhancing feature extraction and regularization.
Contribution
Dracula extends compressive feature learning with a deep, recursive dictionary approach using linear programming, providing a new method for efficient feature selection from sequential data.
Findings
Effective compression of text data using Dracula.
Features derived from Dracula improve model performance.
Linear programming solutions are well-structured and efficient.
Abstract
We propose `Dracula', a new framework for unsupervised feature selection from sequential data such as text. Dracula learns a dictionary of -grams that efficiently compresses a given corpus and recursively compresses its own dictionary; in effect, Dracula is a `deep' extension of Compressive Feature Learning. It requires solving a binary linear program that may be relaxed to a linear program. Both problems exhibit considerable structure, their solution paths are well behaved, and we identify parameters which control the depth and diversity of the dictionary. We also discuss how to derive features from the compressed documents and show that while certain unregularized linear models are invariant to the structure of the compressed dictionary, this structure may be used to regularize learning. Experiments are presented that demonstrate the efficacy of Dracula's features.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Machine Learning and Algorithms · Machine Learning in Bioinformatics
