"Knowing When You Don't Know": A Multilingual Relevance Assessment   Dataset for Robust Retrieval-Augmented Generation

Nandan Thakur; Luiz Bonifacio; Xinyu Zhang; Odunayo Ogundepo; Ehsan; Kamalloo; David Alfonso-Hermelo; Xiaoguang Li; Qun Liu; Boxing Chen; Mehdi; Rezagholizadeh; Jimmy Lin

arXiv:2312.11361·cs.CL·November 12, 2024·1 cites

"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation

Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan, Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi, Rezagholizadeh, Jimmy Lin

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces NoMIRACL, a multilingual dataset for evaluating the robustness of Large Language Models in Retrieval-Augmented Generation across diverse languages, focusing on hallucination and error rates.

Contribution

We created and analyzed NoMIRACL, a human-annotated dataset for assessing LLM robustness in RAG across 18 languages, addressing the lack of multilingual evaluation benchmarks.

Findings

01

Most models have high hallucination rates on non-relevant queries.

02

Mistral and LLAMA-3 show lower hallucination but higher error rates.

03

GPT-4 offers the best balance between hallucination and error rates.

Abstract

Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate LLM robustness against errors in external retrieved knowledge. To overcome this, we establish NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a single judged relevant passage. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

project-miracl/nomiracl
noneOfficial

Datasets

amitbcp/nomir
dataset· 91 dl
91 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Domain Adaptation and Few-Shot Learning

MethodsFlan-T5 · Multi-Head Attention · Attention Is All You Need · WordPiece · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Adam