A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity
Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn

TL;DR
This paper systematically evaluates various document chunking strategies for dense retrieval across multiple domains, demonstrating that content-aware methods significantly enhance retrieval effectiveness and highlighting the importance of segmentation in retrieval-augmented systems.
Contribution
It provides the first large-scale, cross-domain benchmark of 36 document chunking methods, revealing the impact of segmentation strategies on retrieval performance and efficiency.
Findings
Content-aware chunking improves retrieval effectiveness over naive methods.
Paragraph Group Chunking achieved the highest accuracy and hit rates.
Larger embedding models are more sensitive to segmentation quality.
Abstract
We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Biomedical Text Mining and Ontologies · Topic Modeling
