HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World   Hallucination Detection

Deanna Emery; Michael Goitia; Freddie Vargus; Iulia Neagu

arXiv:2505.00506·cs.CL·May 2, 2025

HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection

Deanna Emery, Michael Goitia, Freddie Vargus, Iulia Neagu

PDF

1 Datasets

TL;DR

HalluMix introduces a comprehensive, task-agnostic benchmark for detecting hallucinations in LLM outputs across diverse real-world scenarios, revealing performance gaps and guiding future improvements.

Contribution

The paper presents the HalluMix Benchmark, a novel diverse dataset for hallucination detection, and evaluates multiple systems to analyze performance across various tasks and contexts.

Findings

01

Quotient Detections achieves 0.82 accuracy and 0.84 F1 score.

02

Performance varies significantly between short and long contexts.

03

Existing systems show notable performance disparities across tasks.

Abstract

As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated content $\unicode x 2013$ text that is not grounded in supporting evidence $\unicode x 2013$ has become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systems $\unicode x 2013$ both open and closed source $\unicode x 2013$ highlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

quotientai/HalluMix
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.