# Generalized Deduplication: Bounds, Convergence, and Asymptotic   Properties

**Authors:** Rasmus Vestergaard, Qi Zhang, Daniel E. Lucani

arXiv: 1901.02720 · 2020-03-04

## TL;DR

This paper introduces a generalized deduplication method that achieves near-entropy compression with significantly faster convergence than standard deduplication, enabling earlier and more efficient data compression in practical systems.

## Contribution

It extends standard deduplication to handle highly similar data, providing theoretical bounds, convergence analysis, and demonstrating practical benefits over existing methods.

## Key findings

- Generalized deduplication achieves near-entropy coding costs.
- It converges faster than standard deduplication, by multiple orders of magnitude.
- Numerical examples confirm the theoretical bounds and potential gains.

## Abstract

We study a generalization of deduplication, which enables lossless deduplication of highly similar data and show that standard deduplication with fixed chunk length is a special case. We provide bounds on the expected length of coded sequences for generalized deduplication and show that the coding has asymptotic near-entropy cost under the proposed source model. More importantly, we show that generalized deduplication allows for multiple orders of magnitude faster convergence than standard deduplication. This means that generalized deduplication can provide compression benefits much earlier than standard deduplication, which is key in practical systems. Numerical examples demonstrate our results, showing that our lower bounds are achievable, and illustrating the potential gain of using the generalization over standard deduplication. In fact, we show that even for a simple case of generalized deduplication, the gain in convergence speed is linear with the size of the data chunks.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.02720/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1901.02720/full.md

## References

14 references — full list in the complete paper: https://tomesphere.com/paper/1901.02720/full.md

---
Source: https://tomesphere.com/paper/1901.02720