Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking
Hai Toan Nguyen, Tien Dat Nguyen, Viet Ha Nguyen

TL;DR
This paper introduces a hierarchical text segmentation approach to improve chunking in Retrieval-Augmented Generation systems, leading to more meaningful retrieval and better performance on multiple datasets.
Contribution
It presents a novel framework integrating hierarchical segmentation and clustering for more semantically coherent chunks in RAG systems.
Findings
Improved retrieval accuracy on NarrativeQA, QuALITY, and QASPER datasets.
Enhanced semantic coherence of chunks compared to traditional methods.
Better relevance and context-awareness in generated responses.
Abstract
Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
