Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

I. F. Atasoy; B. Mutlu; E. A. Sezer; A. Wahdan

arXiv:2605.08462·cs.CL·May 12, 2026

Do Benchmarks Underestimate LLM Performance? Evaluating Hallucination Detection With LLM-First Human-Adjudicated Assessment

I. F. Atasoy, B. Mutlu, E. A. Sezer, A. Wahdan

PDF

TL;DR

This study reveals that current benchmarks may underestimate LLM performance on hallucination detection, and proposes human-adjudicated re-evaluation to improve reliability of model assessments.

Contribution

It introduces a human-adjudicated re-evaluation process that aligns benchmark annotations with LLM judgments, enhancing the accuracy of hallucination detection evaluations.

Findings

01

Triple agreement increased by over 6% after re-evaluation.

02

Model accuracy improved by up to 8.5% with human adjudication.

03

Adjudicators often sided with models over original human labels.

Abstract

Hallucination remains a persistent challenge in Large Language Models (LLMs), particularly in context-grounded settings such as RAG and agentic AI systems. This study focuses on contextual hallucination detection in summarization tasks. We analyze the QAGS-C and SummEval datasets by comparing original benchmark annotations with reason and span-based predictions from Gemini 2.5 Flash and GPT-5 Mini. To address systematic divergences between human labels and LLM judgments, we re-evaluated all conflicted samples through a human adjudication process involving 2 cross-cultural adjudicators. Following this re-evaluation, triple agreement (between human, GPT, and Gemini) increased by 6.38% for QAGS-C and 7.62% for SummEval. Similarly, model accuracy improved, with GPT increasing by 4.25% on QAGS-C and 2.34% on SummEval, while Gemini showed gains of 8.51% and 3.80%, respectively. Notably,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.