# An Information-Theoretic Analysis of Deduplication

**Authors:** Urs Niesen

arXiv: 1701.04451 · 2019-11-22

## TL;DR

This paper offers an information-theoretic framework for analyzing data deduplication, introducing new models and schemes, and demonstrating the near-optimality of a novel multi-chunk approach under mild assumptions.

## Contribution

It develops a formal information-theoretic analysis of deduplication, introduces a new multi-chunk scheme, and highlights the importance of boundary synchronization for optimality.

## Key findings

- Multi-chunk deduplication is order optimal under mild conditions.
- Boundary synchronization is crucial for deduplication efficiency.
- The paper formalizes fixed-length and variable-length deduplication schemes.

## Abstract

Deduplication finds and removes long-range data duplicates. It is commonly used in cloud and enterprise server settings and has been successfully applied to primary, backup, and archival storage. Despite its practical importance as a source-coding technique, its analysis from the point of view of information theory is missing. This paper provides such an information-theoretic analysis of data deduplication. It introduces a new source model adapted to the deduplication setting. It formalizes the two standard fixed-length and variable-length deduplication schemes, and it introduces a novel multi-chunk deduplication scheme. It then provides an analysis of these three deduplication variants, emphasizing the importance of boundary synchronization between source blocks and deduplication chunks. In particular, under fairly mild assumptions, the proposed multi-chunk deduplication scheme is shown to be order optimal.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1701.04451/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/1701.04451/full.md

## References

30 references — full list in the complete paper: https://tomesphere.com/paper/1701.04451/full.md

---
Source: https://tomesphere.com/paper/1701.04451