Web Retrieval-Aware Chunking (W-RAC) for Efficient and Cost-Effective Retrieval-Augmented Generation Systems
Uday Allu, Sonu Kedia, Tanmay Odapally, Biddwan Ahmed

TL;DR
This paper introduces W-RAC, a web-specific chunking framework for retrieval-augmented systems that reduces costs and improves scalability by decoupling text extraction from semantic chunking using LLMs.
Contribution
W-RAC is a novel, cost-efficient web document chunking method that separates content extraction from semantic grouping, enhancing scalability and reducing token costs.
Findings
W-RAC reduces chunking-related LLM costs by an order of magnitude.
W-RAC achieves comparable or better retrieval performance than traditional methods.
W-RAC improves system observability and eliminates hallucination risks.
Abstract
Retrieval-Augmented Generation (RAG) systems critically depend on effective document chunking strategies to balance retrieval quality, latency, and operational cost. Traditional chunking approaches, such as fixed-size, rule-based, or fully agentic chunking, often suffer from high token consumption, redundant text generation, limited scalability, and poor debuggability, especially for large-scale web content ingestion. In this paper, we propose Web Retrieval-Aware Chunking (W-RAC), a novel, cost-efficient chunking framework designed specifically for web-based documents. W-RAC decouples text extraction from semantic chunk planning by representing parsed web content as structured, ID-addressable units and leveraging large language models (LLMs) only for retrieval-aware grouping decisions rather than text generation. This significantly reduces token usage, eliminates hallucination risks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
