Vectorized Sequence-Based Chunking for Data Deduplication
Sreeharsha Udayashankar, Samer Al-Kiswany

TL;DR
SeqCDC is a fast, vectorized data chunking algorithm that significantly improves throughput in data deduplication processes while maintaining effective space savings.
Contribution
Introduces SeqCDC, a novel vectorized chunking algorithm utilizing lightweight boundary detection and hardware acceleration for enhanced performance.
Findings
15x higher throughput than unaccelerated methods
1.2x-1.35x higher throughput than other vector-accelerated algorithms
Minimal impact on deduplication space savings
Abstract
Data deduplication has gained wide acclaim as a mechanism to improve storage efficiency and conserve network bandwidth. Its most critical phase, data chunking, is responsible for the overall space savings achieved via the deduplication process. However, modern data chunking algorithms are slow and compute-intensive because they scan large amounts of data while simultaneously making data-driven boundary decisions. We present SeqCDC, a novel chunking algorithm that leverages lightweight boundary detection, content-defined skipping, and SSE/AVX acceleration to improve chunking throughput for large chunk sizes. Our evaluation shows that SeqCDC achieves 15x higher throughput than unaccelerated and 1.2x-1.35x higher throughput than vector-accelerated data chunking algorithms while minimally affecting deduplication space savings.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Data Security Solutions · Advanced Data Storage Technologies · Data Quality and Management
