FaithEval: Can Your Language Model Stay Faithful to Context, Even If "The Moon is Made of Marshmallows"
Yifei Ming, Senthil Purushwalkam, Shrey Pandit, Zixuan Ke, Xuan-Phi, Nguyen, Caiming Xiong, Shafiq Joty

TL;DR
FaithEval introduces a comprehensive benchmark to evaluate the faithfulness of large language models in complex contextual scenarios, revealing that even advanced models often struggle to maintain accuracy and consistency.
Contribution
This work presents FaithEval, a new benchmark with 4.9K problems for assessing LLM faithfulness across diverse tasks, validated through rigorous multi-stage processes.
Findings
State-of-the-art models often fail to stay faithful to context.
Larger models do not necessarily perform better in faithfulness.
The benchmark highlights significant challenges in current LLMs' contextual understanding.
Abstract
Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination-where models generate responses misaligned with the provided context-remains a significant challenge. In this work, we introduce FaithEval, a novel and comprehensive benchmark tailored to evaluate the faithfulness of LLMs in contextual scenarios across three diverse tasks: unanswerable, inconsistent, and counterfactual contexts. These tasks simulate real-world challenges where retrieval mechanisms may surface incomplete, contradictory, or fabricated information. FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
