Benchmarking Retrieval-Augmented Generation for Medicine

Guangzhi Xiong; Qiao Jin; Zhiyong Lu; Aidong Zhang

arXiv:2402.13178·cs.CL·February 26, 2024·23 cites

Benchmarking Retrieval-Augmented Generation for Medicine

Guangzhi Xiong, Qiao Jin, Zhiyong Lu, Aidong Zhang

PDF

Open Access 2 Repos 1 Models 5 Datasets

TL;DR

This paper introduces MIRAGE, a comprehensive benchmark for evaluating retrieval-augmented generation systems in medicine, demonstrating significant performance improvements and revealing key scaling properties and effects.

Contribution

It presents MIRAGE, the first extensive benchmark for medical RAG systems, and provides large-scale experimental insights and best practices for medical question answering.

Findings

01

MedRAG improves accuracy of LLMs by up to 18%.

02

Combining various medical corpora and retrievers yields best performance.

03

Identifies log-linear scaling and 'lost-in-the-middle' effects in medical RAG.

Abstract

While large language models (LLMs) have achieved state-of-the-art performance on a wide range of medical question answering (QA) tasks, they still face challenges with hallucinations and outdated knowledge. Retrieval-augmented generation (RAG) is a promising solution and has been widely adopted. However, a RAG system can involve multiple flexible components, and there is a lack of best practices regarding the optimal RAG setting for various medical purposes. To systematically evaluate such systems, we propose the Medical Information Retrieval-Augmented Generation Evaluation (MIRAGE), a first-of-its-kind benchmark including 7,663 questions from five medical QA datasets. Using MIRAGE, we conducted large-scale experiments with over 1.8 trillion prompt tokens on 41 combinations of different corpora, retrievers, and backbone LLMs through the MedRAG toolkit introduced in this work. Overall,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
HiTZ/Mistral-7B-MedExpQA-EN
model· 5 dl
5 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Biomedical Text Mining and Ontologies

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Warmup With Linear Decay · Linear Layer · WordPiece · Byte Pair Encoding · Attention Dropout · Dense Connections · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia?