Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation

Pooja Guttal; Varun Magotra; Vasudeva Mahavishnu; Natasha Chanto; Sidharth Sivaprasad; Manas Gaur

arXiv:2605.00318·cs.CL·May 4, 2026

Structure-Aware Chunking for Tabular Data in Retrieval-Augmented Generation

Pooja Guttal, Varun Magotra, Vasudeva Mahavishnu, Natasha Chanto, Sidharth Sivaprasad, Manas Gaur

PDF

TL;DR

This paper introduces a structure-aware chunking method for tabular data that enhances retrieval-augmented generation by preserving data structure, leading to better retrieval performance and efficiency.

Contribution

The authors propose a novel hierarchical row-based chunking framework that maintains tabular structure, improving token utilization and retrieval effectiveness over existing methods.

Findings

01

Reduces chunk count by up to 56% compared to baselines.

02

Improves retrieval MRR from 0.3576 to 0.5945.

03

Increases Recall@1 from 0.366 to 0.754.

Abstract

Tabular documents such as CSV and Excel files are widely used in enterprise data pipelines, yet existing chunking strategies for retrieval-augmented generation (RAG) are primarily designed for unstructured text and do not account for tabular structure. We propose a structure-aware tabular chunking (STC) framework that operates on row-level units by constructing a hierarchical Row Tree representation, where each row is encoded as a key-value block. STC performs token-constrained splitting aligned with structural boundaries and applies overlap-free greedy merging to produce dense, non-overlapping chunks. This design preserves semantic relationships between fields within a row while improving token utilization and reducing fragmentation. Across evaluations on the MAUD dataset, STC reduces chunk count by up to 40% and 56% compared to standard recursive and key-value based baselines,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.