Access Time Tradeoffs in Archive Compression

Matthias Petri; Alistair Moffat; P.C. Nagesh; Anthony Wirth

arXiv:1602.08829·cs.IT·March 1, 2016

Access Time Tradeoffs in Archive Compression

Matthias Petri, Alistair Moffat, P.C. Nagesh, Anthony Wirth

PDF

TL;DR

This paper evaluates the tradeoffs between access time and compression efficiency in archive compression, focusing on RLZ and block-based methods across HDD and SSD storage.

Contribution

It provides a detailed analysis of when RLZ compression outperforms alternatives and explores implementation trade-offs for different storage media.

Findings

01

RLZ excels in specific scenarios with large, repetitive data.

02

Compression rate significantly impacts access speed on HDDs.

03

SSD performance is less sensitive to compression parameters.

Abstract

Web archives, query and proxy logs, and so on, can all be very large and highly repetitive; and are accessed only sporadically and partially, rather than continually and holistically. This type of data is ideal for compression-based archiving, provided that random-access to small fragments of the original data can be achieved without needing to decompress everything. The recent RLZ (relative Lempel Ziv) compression approach uses a semi-static model extracted from the text to be compressed, together with a greedy factorization of the whole text encoded using static integer codes. Here we demonstrate more precisely than before the scenarios in which RLZ excels. We contrast RLZ with alternatives based on block-based adaptive methods, including approaches that "prime" the encoding for each block, and measure a range of implementation options using both hard-disk (HDD) and solid-state disk…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.