Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis
Sinchana Ramakanth Bhat, Max Rudat, Jannis Spiekermann, Nicolas Flores-Herr

TL;DR
This paper systematically evaluates how different chunk sizes affect retrieval performance in long-document IR across multiple datasets and embedding models, highlighting the importance of dataset-specific chunking strategies.
Contribution
It provides a comprehensive analysis of fixed-size chunking strategies and their impact on retrieval effectiveness across diverse datasets and embedding models.
Findings
Smaller chunks (64-128 tokens) are optimal for fact-based datasets.
Larger chunks (512-1024 tokens) improve retrieval for context-rich datasets.
Embedding models exhibit different sensitivities to chunk size.
Abstract
Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and their influence on retrieval performance using multiple embedding models. Our experiments, conducted on both short-form and long-form datasets, reveal that chunk size plays a critical role in retrieval effectiveness -- smaller chunks (64-128 tokens) are optimal for datasets with concise, fact-based answers, whereas larger chunks (512-1024 tokens) improve retrieval in datasets requiring broader contextual understanding. We also analyze the impact of chunking on different embedding models, finding that they exhibit distinct chunking sensitivities. While models like Stella benefit from larger chunks, leveraging global context for long-range retrieval,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems
