Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis

Sinchana Ramakanth Bhat; Max Rudat; Jannis Spiekermann; Nicolas Flores-Herr

arXiv:2505.21700·cs.IR·May 30, 2025

Rethinking Chunk Size For Long-Document Retrieval: A Multi-Dataset Analysis

Sinchana Ramakanth Bhat, Max Rudat, Jannis Spiekermann, Nicolas Flores-Herr

PDF

Open Access 1 Repo

TL;DR

This paper systematically evaluates how different chunk sizes affect retrieval performance in long-document IR across multiple datasets and embedding models, highlighting the importance of dataset-specific chunking strategies.

Contribution

It provides a comprehensive analysis of fixed-size chunking strategies and their impact on retrieval effectiveness across diverse datasets and embedding models.

Findings

01

Smaller chunks (64-128 tokens) are optimal for fact-based datasets.

02

Larger chunks (512-1024 tokens) improve retrieval for context-rich datasets.

03

Embedding models exhibit different sensitivities to chunk size.

Abstract

Chunking is a crucial preprocessing step in retrieval-augmented generation (RAG) systems, significantly impacting retrieval effectiveness across diverse datasets. In this study, we systematically evaluate fixed-size chunking strategies and their influence on retrieval performance using multiple embedding models. Our experiments, conducted on both short-form and long-form datasets, reveal that chunk size plays a critical role in retrieval effectiveness -- smaller chunks (64-128 tokens) are optimal for datasets with concise, fact-based answers, whereas larger chunks (512-1024 tokens) improve retrieval in datasets requiring broader contextual understanding. We also analyze the impact of chunking on different embedding models, finding that they exhibit distinct chunking sensitivities. While models like Stella benefit from larger chunks, leveraging global context for long-range retrieval,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fraunhofer-iais/chunking-strategies
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Expert finding and Q&A systems