TL;DR
This paper presents new compression techniques for indexing highly repetitive document collections, significantly reducing space usage while maintaining moderate query speed, and introduces universal self-indexes that further compress data at the cost of speed.
Contribution
The paper introduces novel compression methods for inverted indexes exploiting near-copy regularities and proposes universal self-indexes that are highly space-efficient for repetitive data.
Findings
Repetitive collections can be compressed significantly using run-length, Lempel-Ziv, or grammar compression.
New techniques reduce space compared to classical methods, with moderate slowdown.
Self-indexes achieve even greater compression but are much slower.
Abstract
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
