HalluHard: A Hard Multi-Turn Hallucination Benchmark

Dongyang Fan; Sebastien Delsad; Nicolas Flammarion; Maksym Andriushchenko

arXiv:2602.01031·cs.AI·February 3, 2026

HalluHard: A Hard Multi-Turn Hallucination Benchmark

Dongyang Fan, Sebastien Delsad, Nicolas Flammarion, Maksym Andriushchenko

PDF

Open Access

TL;DR

HalluHard is a new multi-turn hallucination benchmark for LLMs across high-stakes domains, highlighting persistent factual errors despite retrieval-based grounding methods.

Contribution

It introduces a challenging, multi-domain hallucination benchmark with a novel evidence retrieval and evaluation pipeline for assessing factual grounding.

Findings

01

Hallucinations remain high (~30%) even with web search grounding.

02

Model capacity and turn position influence hallucination rates.

03

Content-grounding errors persist across models and domains.

Abstract

Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce $HalluHard$ , a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search ( $\approx 30%$ for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Misinformation and Its Impacts · Text Readability and Simplification