Vectorized Sequence-Based Chunking for Data Deduplication

Sreeharsha Udayashankar; Samer Al-Kiswany

arXiv:2505.21194·cs.DC·May 28, 2025

Vectorized Sequence-Based Chunking for Data Deduplication

Sreeharsha Udayashankar, Samer Al-Kiswany

PDF

Open Access

TL;DR

SeqCDC is a fast, vectorized data chunking algorithm that significantly improves throughput in data deduplication processes while maintaining effective space savings.

Contribution

Introduces SeqCDC, a novel vectorized chunking algorithm utilizing lightweight boundary detection and hardware acceleration for enhanced performance.

Findings

01

15x higher throughput than unaccelerated methods

02

1.2x-1.35x higher throughput than other vector-accelerated algorithms

03

Minimal impact on deduplication space savings

Abstract

Data deduplication has gained wide acclaim as a mechanism to improve storage efficiency and conserve network bandwidth. Its most critical phase, data chunking, is responsible for the overall space savings achieved via the deduplication process. However, modern data chunking algorithms are slow and compute-intensive because they scan large amounts of data while simultaneously making data-driven boundary decisions. We present SeqCDC, a novel chunking algorithm that leverages lightweight boundary detection, content-defined skipping, and SSE/AVX acceleration to improve chunking throughput for large chunk sizes. Our evaluation shows that SeqCDC achieves 15x higher throughput than unaccelerated and 1.2x-1.35x higher throughput than vector-accelerated data chunking algorithms while minimally affecting deduplication space savings.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Data Security Solutions · Advanced Data Storage Technologies · Data Quality and Management