Chunk Content is not Enough: Chunk-Context Aware Resemblance Detection for Deduplication Delta Compression
Xuming Ye, Xiaoye Xue, Wenlong Tian, Zhiyong Xu, Weijun Xiao, Ruixuan, Li

TL;DR
This paper introduces CARD, a chunk-context aware resemblance detection algorithm that combines internal chunk structure and context information, significantly improving redundancy detection and detection speed in cloud storage deduplication.
Contribution
The paper proposes a novel neural network-based method that enhances resemblance detection by integrating chunk content and context, outperforming existing methods in accuracy and efficiency.
Findings
Detects up to 75.03% more redundant data
Accelerates resemblance detection by 5.6 to 17.8 times
Effectively reduces impact of small data changes
Abstract
With the growing popularity of cloud storage, removing duplicated data across users is getting more critical for service providers to reduce costs. Recently, Data resemblance detection is a novel technology to detect redundancy among similarity. It extracts feature from each chunk content and treat chunks with high similarity as candidates for removing redundancy. However, popular resemblance methods such as "N-transform" and "Finesse" use only the chunk data for feature extraction. A minor modification on the data chunk could seriously deteriorate its capability for resemblance detection. In this paper, we proposes a novel chunk-context aware resemblance detection algorithm, called CARD, to mitigate this issue. CARD introduces a BP-Neural network-based chunk-context aware model, and uses N-sub-chunk shingles-based initial feature extraction strategy. It effectively integrates each data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions · Data Quality and Management · Advanced Data Storage Technologies
