HalluHard: A Hard Multi-Turn Hallucination Benchmark
Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, Maksym Andriushchenko

TL;DR
HalluHard is a new multi-turn hallucination benchmark for LLMs across high-stakes domains, highlighting persistent factual errors despite retrieval-based grounding methods.
Contribution
It introduces a challenging, multi-domain hallucination benchmark with a novel evidence retrieval and evaluation pipeline for assessing factual grounding.
Findings
Hallucinations remain high (~30%) even with web search grounding.
Model capacity and turn position influence hallucination rates.
Content-grounding errors persist across models and domains.
Abstract
Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce , a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search ( for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Misinformation and Its Impacts · Text Readability and Simplification
