Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

Sietse Schelpe

arXiv:2605.09611·cs.CL·May 12, 2026

Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

Sietse Schelpe

PDF

TL;DR

This paper empirically analyzes byte-exact deduplication in Retrieval-Augmented Generation, demonstrating significant context reduction and compute savings across various regimes without quality loss, validated through human evaluations on multiple APIs.

Contribution

It provides the first large-scale empirical validation that byte-exact deduplication can reduce context size and inference costs in RAG pipelines without affecting quality.

Findings

01

Context reduction up to 80.34% in conversational AI regimes.

02

No measurable quality regression observed after deduplication.

03

All tested vendors passed the strict quality preservation thresholds.

Abstract

This preprint presents an empirical analysis of byte-exact chunk-level deduplication in Retrieval-Augmented Generation (RAG) pipelines. We measure context reduction across three distinct operating regimes: clean academic retrieval (0.16% byte reduction on 22.2M BeIR passages), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction). To validate quality preservation, we conducted a cross-vendor 5-judge calibrated panel evaluation across four production APIs (Google Gemini 2.5 Flash, Anthropic Claude Sonnet 4.6, Meta Llama 3.3 70B, and OpenAI GPT-5.1). Applying a five-category human-in-the-loop noise-removal protocol to panel-majority materially different (MAT) pairs, we establish that byte-exact deduplication introduces zero measurable quality regression. Post-audit, all four vendors clear the strict <5% Wilson 95% upper-bound MAT threshold…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.