Data Deduplication with Random Substitutions

Hao Lou; Farzad Farnoud

arXiv:2107.00490·cs.IT·May 30, 2022

Data Deduplication with Random Substitutions

Hao Lou, Farzad Farnoud

PDF

TL;DR

This paper analyzes the performance of data deduplication algorithms on data streams with probabilistic symbol substitutions, proposing modifications that achieve near-optimal performance and high compression ratios.

Contribution

It introduces an information-theoretic model for approximate deduplication with substitutions and proposes modifications to fixed-length schemes that perform near-optimally.

Findings

01

Fixed-length deduplication is unsuitable for substitution models.

02

Modified algorithms perform within a constant factor of optimal.

03

Variable-length deduplication achieves high compression ratios as entropy decreases.

Abstract

Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis on the performance of deduplication algorithms on data streams in which repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. More precisely, each symbol in a repeated string is substituted with a given edit probability. Deduplication algorithms in both the fixed-length scheme and the variable-length scheme are studied. The fixed-length deduplication algorithm is shown to be unsuitable for the proposed source model as it does not take into account the edit probability. Two modifications are proposed and shown to have performances…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.