Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen; Hongyu Lin; Xianpei Han; Le Sun

arXiv:2309.01431·cs.CL·December 21, 2023·53 cites

Benchmarking Large Language Models in Retrieval-Augmented Generation

Jiawei Chen, Hongyu Lin, Xianpei Han, Le Sun

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper systematically evaluates how large language models perform in retrieval-augmented generation tasks, identifying key challenges and establishing a new benchmark to guide future improvements in LLM capabilities.

Contribution

It introduces the Retrieval-Augmented Generation Benchmark (RGB) for comprehensive evaluation of LLMs across fundamental abilities relevant to RAG.

Findings

01

LLMs show moderate noise robustness

02

Struggle with negative rejection and information integration

03

Significant challenges remain for effective RAG application

Abstract

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chen700564/RGB
noneOfficial

Videos

Benchmarking Large Language Models in Retrieval-Augmented Generation· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Layer Normalization · Linear Layer · Dense Connections · Attention Dropout · Residual Connection · Adam · Weight Decay