A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication
Marcel Gregoriadis, Leonhard Balduf, Bj\"orn Scheuermann, Johan, Pouwelse

TL;DR
This paper provides a comprehensive theoretical and experimental comparison of leading Content-Defined Chunking algorithms for data deduplication, offering insights into their performance and limitations across multiple datasets.
Contribution
It offers the first thorough, impartial analysis and comparison of CDC algorithms, including new experimental results and contextual insights into their effectiveness.
Findings
Identifies strengths and weaknesses of various CDC algorithms.
Provides detailed performance metrics across multiple datasets.
Highlights previously unnoticed limitations of existing algorithms.
Abstract
Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs in cloud settings by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions · Advanced Data Storage Technologies · Cryptography and Data Security
