Accelerating Data Chunking in Deduplication Systems using Vector Instructions

Sreeharsha Udayashankar; Abdelrahman Baba; Samer Al-Kiswany

arXiv:2508.05797·cs.DC·January 28, 2026

Accelerating Data Chunking in Deduplication Systems using Vector Instructions

Sreeharsha Udayashankar, Abdelrahman Baba, Samer Al-Kiswany

PDF

Open Access

TL;DR

VectorCDC significantly accelerates content-defined chunking in deduplication systems by leveraging vector CPU instructions, achieving up to 26x throughput improvements without compromising space savings.

Contribution

The paper introduces VectorCDC, a novel vector instruction-based approach to speed up hashless CDC algorithms in data deduplication.

Findings

01

Achieves 8.35x - 26.2x higher throughput than existing vector algorithms.

02

Achieves 15.3x - 207.2x higher throughput than unaccelerated algorithms.

03

Maintains deduplication space savings.

Abstract

Content-defined Chunking (CDC) algorithms dictate the overall space savings that deduplication systems achieve. However, due to their need to scan each file in its entirety, they are slow and often the main performance bottleneck within data deduplication. We present VectorCDC, a method to accelerate hashless CDC algorithms using vector CPU instructions, such as SSE / AVX. We analyzed the state-of-the-art chunking algorithms and discovered that hashless algorithms primarily use two data processing patterns to identify chunk boundaries: Extreme Byte Searches and Range Scans. VectorCDC presents a vector-friendly approach to accelerate these two patterns. Using VectorCDC, we accelerated three state-of-the-art hashless chunking algorithms: RAM, AE, and MAXP. Our evaluation shows that VectorCDC is effective on Intel, AMD, ARM, and IBM CPUs, achieving 8.35x - 26.2x higher throughput than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Data Security Solutions · Advanced Data Storage Technologies · Distributed systems and fault tolerance