Tree-Based Text Retrieval via Hierarchical Clustering in RAGFrameworks: Application on Taiwanese Regulations
Chia-Heng Yu, Yen-Lung Tsai

TL;DR
This paper introduces a hierarchical clustering-based retrieval method for RAG systems that adaptively selects relevant documents without predefining k, improving accuracy and interpretability in legal text retrieval.
Contribution
The proposed method eliminates the need to predefine k in RAG retrieval, enhancing relevance and interpretability in legal document retrieval tasks.
Findings
Achieves superior performance in expert evaluations
Maintains high precision without predefining k
Improves accuracy and interpretability in legal text retrieval
Abstract
Traditional Retrieval-Augmented Generation (RAG) systems employ brute-force inner product search to retrieve the top-k most similar documents, then combined with the user query and passed to a language model. This allows the model to access external knowledge and reduce hallucinations. However, selecting an appropriate k value remains a significant challenge in practical applications: a small k may fail to retrieve sufficient information, while a large k can introduce excessive and irrelevant content. To address this, we propose a hierarchical clustering-based retrieval method that eliminates the need to predefine k. Our approach maintains the accuracy and relevance of system responses while adaptively selecting semantically relevant content. In the experiment stage, we applied our method to a Taiwanese legal dataset with expert-graded queries. The results show that our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Dropout · Byte Pair Encoding · Softmax · Dense Connections · Layer Normalization · Linear Warmup With Linear Decay · BERT · BART
