Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

Shuo Yu (1); Mingyue Cheng (1); Qi Liu (1); Daoyu Wang (1); Jiqian Yang (1); Jie Ouyang (1); Yucong Luo (1); Chenyi Lei (2); Enhong Chen (1) ((1) State Key Laboratory of Cognitive Intelligence; University of Science; Technology of China; Hefei; China (2) Kuaishou Technology; Beijing; China)

arXiv:2409.13694·cs.CL·October 10, 2025·2 cites

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

Shuo Yu (1), Mingyue Cheng (1), Qi Liu (1), Daoyu Wang (1), Jiqian Yang (1), Jie Ouyang (1), Yucong Luo (1), Chenyi Lei (2), Enhong Chen (1) ((1) State Key Laboratory of Cognitive Intelligence, University of Science, Technology of China, Hefei, China (2) Kuaishou Technology

PDF

Open Access 2 Repos

TL;DR

This paper introduces a benchmark dataset and a pruning-based framework for retrieval-augmented generation that effectively integrates multiple knowledge sources, reducing hallucinations and improving performance.

Contribution

It provides the first standardized multi-source knowledge dataset and a novel PruningRAG framework with multi-granularity pruning strategies for better knowledge integration in RAG.

Findings

01

PruningRAG improves performance across various RAG models.

02

The dataset enables comprehensive evaluation of multi-source knowledge integration.

03

Pruning strategies effectively reduce misleading information.

Abstract

Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach to mitigating the hallucination of large language models (LLMs) through the integration of external knowledge. While numerous efforts, most studies focus on a single type of external knowledge source. However, in real-world applications, most situations involve diverse knowledge from various sources, yet this area has been less explored. The main dilemma is the lack of a suitable dataset containing multiple knowledge sources and pre-exploration of the associated issues. To address these challenges, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse and complementary domains. Based on this dataset, we further develop a plug-and-play RAG framework, \textbf{PruningRAG}, whose main characteristic is the use of multi-granularity pruning strategies to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Information Retrieval and Search Behavior · Recommender Systems and Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Pruning · Focus · Sparse Evolutionary Training · Linear Layer · Attention Dropout · Dense Connections · Multi-Head Attention · Linear Warmup With Linear Decay