Access Time Tradeoffs in Archive Compression
Matthias Petri, Alistair Moffat, P.C. Nagesh, Anthony Wirth

TL;DR
This paper evaluates the tradeoffs between access time and compression efficiency in archive compression, focusing on RLZ and block-based methods across HDD and SSD storage.
Contribution
It provides a detailed analysis of when RLZ compression outperforms alternatives and explores implementation trade-offs for different storage media.
Findings
RLZ excels in specific scenarios with large, repetitive data.
Compression rate significantly impacts access speed on HDDs.
SSD performance is less sensitive to compression parameters.
Abstract
Web archives, query and proxy logs, and so on, can all be very large and highly repetitive; and are accessed only sporadically and partially, rather than continually and holistically. This type of data is ideal for compression-based archiving, provided that random-access to small fragments of the original data can be achieved without needing to decompress everything. The recent RLZ (relative Lempel Ziv) compression approach uses a semi-static model extracted from the text to be compressed, together with a greedy factorization of the whole text encoded using static integer codes. Here we demonstrate more precisely than before the scenarios in which RLZ excels. We contrast RLZ with alternatives based on block-based adaptive methods, including approaches that "prime" the encoding for each block, and measure a range of implementation options using both hard-disk (HDD) and solid-state disk…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
