FaithEval: Can Your Language Model Stay Faithful to Context, Even If   "The Moon is Made of Marshmallows"

Yifei Ming; Senthil Purushwalkam; Shrey Pandit; Zixuan Ke; Xuan-Phi; Nguyen; Caiming Xiong; Shafiq Joty

arXiv:2410.03727·cs.CL·April 28, 2025·3 cites

FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"

Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi, Nguyen, Caiming Xiong, Shafiq Joty

PDF

Open Access 1 Repo 3 Datasets

TL;DR

FaithEval introduces a comprehensive benchmark to evaluate the faithfulness of large language models in complex contextual scenarios, revealing that even advanced models often struggle to maintain accuracy and consistency.

Contribution

This work presents FaithEval, a new benchmark with 4.9K problems for assessing LLM faithfulness across diverse tasks, validated through rigorous multi-stage processes.

Findings

01

State-of-the-art models often fail to stay faithful to context.

02

Larger models do not necessarily perform better in faithfulness.

03

The benchmark highlights significant challenges in current LLMs' contextual understanding.

Abstract

Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significant challenge. In this work, we introduce FaithEval, a novel and comprehensive benchmark tailored to evaluate the faithfulness of LLMs in contextual scenarios across three diverse tasks: unanswerable, inconsistent, and counterfactual contexts. These tasks simulate real-world challenges where retrieval mechanisms may surface incomplete, contradictory, or fabricated information. FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salesforceairesearch/faitheval
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling