Breaking It Down: Domain-Aware Semantic Segmentation for Retrieval Augmented Generation
Aparajitha Allamraju, Maitreya Prafulla Chitale, Hiranmai Sri Adibhatla, Rahul Mishra, Manish Shrivastava

TL;DR
This paper presents two novel semantic chunking methods, PSC and MFC, that significantly improve retrieval and generation quality in domain-specific document processing, with strong out-of-domain generalization.
Contribution
Introduction of two efficient semantic chunking methods, PSC and MFC, trained on PubMed data, with an evaluation framework demonstrating their impact on retrieval and generation.
Findings
Substantial retrieval improvements (24x with PSC) in MRR and Hits@k.
PSC and MFC generalize well across multiple datasets.
PSC consistently delivers superior performance.
Abstract
Document chunking is a crucial component of Retrieval-Augmented Generation (RAG), as it directly affects the retrieval of relevant and precise context. Conventional fixed-length and recursive splitters often produce arbitrary, incoherent segments that fail to preserve semantic structure. Although semantic chunking has gained traction, its influence on generation quality remains underexplored. This paper introduces two efficient semantic chunking methods, Projected Similarity Chunking (PSC) and Metric Fusion Chunking (MFC), trained on PubMed data using three different embedding models. We further present an evaluation framework that measures the effect of chunking on both retrieval and generation by augmenting PubMedQA with full-text PubMed Central articles. Our results show substantial retrieval improvements (24x with PSC) in MRR and higher Hits@k on PubMedQA. We provide a comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Biomedical Text Mining and Ontologies
