HalluDial: A Large-Scale Benchmark for Automatic Dialogue-Level Hallucination Evaluation
Wen Luo, Tianshu Shen, Wei Li, Guangyue Peng, Richeng Xuan, and Houfeng Wang, Xi Yang

TL;DR
HalluDial is a large-scale benchmark designed to evaluate and improve the automatic detection of hallucinations at the dialogue level in large language models, addressing previous limitations in scope and methodology.
Contribution
It introduces the first comprehensive dialogue-level hallucination benchmark, covering factuality and faithfulness, and evaluates LLMs with a new specialized judge model, HalluJudge.
Findings
HalluJudge achieves state-of-the-art evaluation performance.
HalluDial covers over 4,000 dialogues with nearly 147,000 samples.
The benchmark reveals strengths and weaknesses of current LLM hallucination detection methods.
Abstract
Large Language Models (LLMs) have significantly advanced the field of Natural Language Processing (NLP), achieving remarkable performance across diverse tasks and enabling widespread real-world applications. However, LLMs are prone to hallucination, generating content that either conflicts with established knowledge or is unfaithful to the original sources. Existing hallucination benchmarks primarily focus on sentence- or passage-level hallucination detection, neglecting dialogue-level evaluation, hallucination localization, and rationale provision. They also predominantly target factuality hallucinations while underestimating faithfulness hallucinations, often relying on labor-intensive or non-specialized evaluators. To address these limitations, we propose HalluDial, the first comprehensive large-scale benchmark for automatic dialogue-level hallucination evaluation. HalluDial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSchizophrenia research and treatment · Psychosomatic Disorders and Their Treatments · Anxiety, Depression, Psychometrics, Treatment, Cognitive Processes
MethodsFocus
