Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation
Pooja Guttal, Varun Magotra, Vasudeva Mahavishnu, Natasha Chanto, Sidharth Sivaprasad, Manas Gaur

TL;DR
This paper introduces a structure-aware chunking method for tabular data that enhances retrieval-augmented generation by preserving data structure, leading to better retrieval performance and efficiency.
Contribution
The authors propose a novel hierarchical row-based chunking framework that maintains tabular structure, improving token utilization and retrieval effectiveness over existing methods.
Findings
Reduces chunk count by up to 56% compared to baselines.
Improves retrieval MRR from 0.3576 to 0.5945.
Increases Recall@1 from 0.366 to 0.754.
Abstract
Tabular documents such as CSV and Excel files are widely used in enterprise data pipelines, yet existing chunking strategies for retrieval-augmented generation (RAG) are primarily designed for unstructured text and do not account for tabular structure. We propose a structure-aware tabular chunking (STC) framework that operates on row-level units by constructing a hierarchical Row Tree representation, where each row is encoded as a key-value block. STC performs token-constrained splitting aligned with structural boundaries and applies overlap-free greedy merging to produce dense, non-overlapping chunks. This design preserves semantic relationships between fields within a row while improving token utilization and reducing fragmentation. Across evaluations on the MAUD dataset, STC reduces chunk count by up to 40% and 56% compared to standard recursive and key-value based baselines,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
