Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Manveer Singh Tamber; Forrest Sheng Bao; Chenyu Xu; Ge Luo; Suleman Kazi; Minseok Bae; Miaoran Li; Ofer Mendelevitch; Renyi Qu; Jimmy Lin

arXiv:2505.04847·cs.CL·November 7, 2025

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards

Manveer Singh Tamber, Forrest Sheng Bao, Chenyu Xu, Ge Luo, Suleman Kazi, Minseok Bae, Miaoran Li, Ofer Mendelevitch, Renyi Qu, Jimmy Lin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces FaithJudge, an LLM-based evaluation framework, and an enhanced hallucination leaderboard to improve benchmarking of LLM faithfulness in retrieval-augmented generation tasks, aiming to reduce hallucinations and increase trustworthiness.

Contribution

The paper presents FaithJudge, a novel LLM-as-a-judge framework, and an improved hallucination leaderboard for more reliable benchmarking of LLM hallucinations in RAG tasks.

Findings

01

FaithJudge significantly improves hallucination detection accuracy.

02

Enhanced leaderboard enables more reliable LLM benchmarking.

03

Framework supports development of trustworthy AI systems.

Abstract

Retrieval-augmented generation (RAG) aims to reduce hallucinations by grounding responses in external context, yet large language models (LLMs) still frequently introduce unsupported information or contradictions even when provided with relevant context. This paper presents two complementary efforts at Vectara to measure and benchmark LLM faithfulness in RAG. First, we describe our original hallucination leaderboard, which has tracked hallucination rates for LLMs since 2023 using our HHEM hallucination detection model. Motivated by limitations observed in current hallucination detection methods, we introduce FaithJudge, an LLM-as-a-judge framework that leverages a pool of diverse human-annotated hallucination examples to substantially improve the automated hallucination evaluation of LLMs. We introduce an enhanced hallucination leaderboard centered on FaithJudge that benchmarks LLMs on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vectara/FaithJudge
noneOfficial

Videos

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards· underline

Taxonomy

TopicsTopic Modeling · Mental Health via Writing · Misinformation and Its Impacts

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Byte Pair Encoding · Attention Dropout · Softmax · Residual Connection · WordPiece