Beyond Chunk-Then-Embed: A Comprehensive Taxonomy and Evaluation of Document Chunking Strategies for Information Retrieval
Yongjie Zhou, Shuai Wang, Bevan Koopman, Guido Zuccon

TL;DR
This paper systematically evaluates various document chunking strategies for dense retrieval, revealing that optimal methods depend on the specific retrieval task and providing a unified framework for comparison.
Contribution
It introduces a comprehensive taxonomy and evaluation framework for document chunking strategies, unifying diverse approaches and analyzing their effectiveness across different retrieval settings.
Findings
Simple structure-based chunking outperforms LLM-guided methods in in-corpus retrieval.
LumberChunker excels in in-document retrieval tasks.
Contextualized chunking benefits in-corpus but not in-document retrieval.
Abstract
Document chunking is a critical preprocessing step in dense retrieval systems, yet the design space of chunking strategies remains poorly understood. Recent research has proposed several concurrent approaches, including LLM-guided methods (e.g., DenseX and LumberChunker) and contextualized strategies(e.g., Late Chunking), which generate embeddings before segmentation to preserve contextual information. However, these methods emerged independently and were evaluated on benchmarks with minimal overlap, making direct comparisons difficult. This paper reproduces prior studies in document chunking and presents a systematic framework that unifies existing strategies along two key dimensions: (1) segmentation methods, including structure-based methods (fixed-size, sentence-based, and paragraph-based) as well as semantically-informed and LLM-guided methods; and (2) embedding paradigms, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Biomedical Text Mining and Ontologies
