Chunk Twice, Embed Once: A Systematic Study of Segmentation and Representation Trade-offs in Chemistry-Aware Retrieval-Augmented Generation
Mahmoud Amiri, Thomas Bocklitz

TL;DR
This paper systematically evaluates document segmentation and embedding strategies for chemistry-focused Retrieval-Augmented Generation systems, revealing optimal configurations and providing benchmarks to guide future development.
Contribution
It offers the first large-scale analysis of segmentation and embedding choices in chemistry RAG, introducing new datasets and evaluation tools for domain-specific retrieval.
Findings
Recursive token-based chunking (R100-0) performs best.
Retrieval-optimized embeddings outperform domain-specific models.
New benchmark datasets and evaluation framework released.
Abstract
Retrieval-Augmented Generation (RAG) systems are increasingly vital for navigating the ever-expanding body of scientific literature, particularly in high-stakes domains such as chemistry. Despite the promise of RAG, foundational design choices -- such as how documents are segmented and represented -- remain underexplored in domain-specific contexts. This study presents the first large-scale, systematic evaluation of chunking strategies and embedding models tailored to chemistry-focused RAG systems. We investigate 25 chunking configurations across five method families and evaluate 48 embedding models on three chemistry-specific benchmarks, including the newly introduced QuestChemRetrieval dataset. Our results reveal that recursive token-based chunking (specifically R100-0) consistently outperforms other approaches, offering strong performance with minimal resource overhead. We also find…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Warmup With Linear Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Byte Pair Encoding · Dense Connections · Softmax · Layer Normalization · Dropout · BERT · BART
