Reducing Data Fragmentation in Data Deduplication Systems via Partial Repetition and Coding
Yun-Han Li, Jin Sima, Ilan Shomorony, Olgica Milenkovic

TL;DR
This paper develops a theoretical framework for data deduplication systems, focusing on reducing fragmentation and increasing robustness through graph models, limited duplication, and coding techniques.
Contribution
It introduces a novel graph-based model for file structures and new metrics for measuring fragmentation, along with coding strategies to mitigate fragmentation and enhance robustness.
Findings
New graph model for file structures as self-avoiding paths
Metrics for fragmentation including stretch and jump
Coding approaches to reduce fragmentation and improve robustness
Abstract
Data deduplication, one of the key features of modern Big Data storage devices, is the process of removing replicas of data chunks stored by different users. Despite the importance of deduplication, several drawbacks of the method, such as storage robustness and file fragmentation, have not been previously analyzed from a theoretical point of view. Storage robustness pertains to ensuring that deduplicated data can be used to reconstruct the original files without service disruptions and data loss. Fragmentation pertains to the problems of placing deduplicated data chunks of different user files in a proximity-preserving linear order, since neighboring chunks of the same file may be stored in sectors far apart on the server. This work proposes a new theoretical model for data fragmentation and introduces novel graph- and coding-theoretic approaches for reducing fragmentation via limited…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions · Advanced Data Storage Technologies · Privacy-Preserving Technologies in Data
