Data Deduplication with Random Substitutions
Hao Lou, Farzad Farnoud

TL;DR
This paper analyzes the performance of data deduplication algorithms on data streams with probabilistic symbol substitutions, proposing modifications that achieve near-optimal performance and high compression ratios.
Contribution
It introduces an information-theoretic model for approximate deduplication with substitutions and proposes modifications to fixed-length schemes that perform near-optimally.
Findings
Fixed-length deduplication is unsuitable for substitution models.
Modified algorithms perform within a constant factor of optimal.
Variable-length deduplication achieves high compression ratios as entropy decreases.
Abstract
Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algorithms on data streams in which repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. More precisely, each symbol in a repeated string is substituted with a given edit probability. Deduplication algorithms in both the fixed-length scheme and the variable-length scheme are studied. The fixed-length deduplication algorithm is shown to be unsuitable for the proposed source model as it does not take into account the edit probability. Two modifications are proposed and shown to have performances…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
