Benchmarking Large Language Models in Retrieval-Augmented Generation
Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun

TL;DR
This paper systematically evaluates how large language models perform in retrieval-augmented generation tasks, identifying key challenges and establishing a new benchmark to guide future improvements in LLM capabilities.
Contribution
It introduces the Retrieval-Augmented Generation Benchmark (RGB) for comprehensive evaluation of LLMs across fundamental abilities relevant to RAG.
Findings
LLMs show moderate noise robustness
Struggle with negative rejection and information integration
Significant challenges remain for effective RAG application
Abstract
Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Dense Connections · Attention Dropout · Residual Connection · Adam · Weight Decay
