Factorization-based Lossless Compression of Inverted Indices
George Beskales, Marcus Fontoura, Maxim Gurevich, Sergei, Vassilvitskii, Vanja Josifovski

TL;DR
This paper introduces a lossless compression method for inverted indices using matrix factorization, achieving significant size reduction while maintaining query performance, and can be combined with other techniques for even better results.
Contribution
The paper formulates inverted index compression as a matrix factorization problem, develops a greedy algorithm for approximation, and proposes query modification methods to mitigate performance impacts.
Findings
Achieves 20% index size reduction without affecting query times.
Up to 35% compression with slight query delay.
Combining with variable-byte encoding reduces size by 50%.
Abstract
Many large-scale Web applications that require ranked top-k retrieval such as Web search and online advertising are implemented using inverted indices. An inverted index represents a sparse term-document matrix, where non-zero elements indicate the strength of term-document association. In this work, we present an approach for lossless compression of inverted indices. Our approach maps terms in a document corpus to a new term space in order to reduce the number of non-zero elements in the term-document matrix, resulting in a more compact inverted index. We formulate the problem of selecting a new term space that minimizes the resulting index size as a matrix factorization problem, and prove that finding the optimal factorization is an NP-hard problem. We develop a greedy algorithm for finding an approximate solution. A side effect of our approach is increasing the number of terms in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Data Management and Algorithms · Advanced Database Systems and Queries
