Reducing Data Fragmentation in Data Deduplication Systems via Partial   Repetition and Coding

Yun-Han Li; Jin Sima; Ilan Shomorony; Olgica Milenkovic

arXiv:2411.01407·cs.IT·November 5, 2024

Reducing Data Fragmentation in Data Deduplication Systems via Partial Repetition and Coding

Yun-Han Li, Jin Sima, Ilan Shomorony, Olgica Milenkovic

PDF

Open Access

TL;DR

This paper develops a theoretical framework for data deduplication systems, focusing on reducing fragmentation and increasing robustness through graph models, limited duplication, and coding techniques.

Contribution

It introduces a novel graph-based model for file structures and new metrics for measuring fragmentation, along with coding strategies to mitigate fragmentation and enhance robustness.

Findings

01

New graph model for file structures as self-avoiding paths

02

Metrics for fragmentation including stretch and jump

03

Coding approaches to reduce fragmentation and improve robustness

Abstract

Data deduplication, one of the key features of modern Big Data storage devices, is the process of removing replicas of data chunks stored by different users. Despite the importance of deduplication, several drawbacks of the method, such as storage robustness and file fragmentation, have not been previously analyzed from a theoretical point of view. Storage robustness pertains to ensuring that deduplicated data can be used to reconstruct the original files without service disruptions and data loss. Fragmentation pertains to the problems of placing deduplicated data chunks of different user files in a proximity-preserving linear order, since neighboring chunks of the same file may be stored in sectors far apart on the server. This work proposes a new theoretical model for data fragmentation and introduces novel graph- and coding-theoretic approaches for reducing fragmentation via limited…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Data Security Solutions · Advanced Data Storage Technologies · Privacy-Preserving Technologies in Data