HalluMix: A Task-Agnostic, Multi-Domain Benchmark for Real-World Hallucination Detection
Deanna Emery, Michael Goitia, Freddie Vargus, Iulia Neagu

TL;DR
HalluMix introduces a comprehensive, task-agnostic benchmark for detecting hallucinations in LLM outputs across diverse real-world scenarios, revealing performance gaps and guiding future improvements.
Contribution
The paper presents the HalluMix Benchmark, a novel diverse dataset for hallucination detection, and evaluates multiple systems to analyze performance across various tasks and contexts.
Findings
Quotient Detections achieves 0.82 accuracy and 0.84 F1 score.
Performance varies significantly between short and long contexts.
Existing systems show notable performance disparities across tasks.
Abstract
As large language models (LLMs) are increasingly deployed in high-stakes domains, detecting hallucinated contenttext that is not grounded in supporting evidencehas become a critical challenge. Existing benchmarks for hallucination detection are often synthetically generated, narrowly focused on extractive question answering, and fail to capture the complexity of real-world scenarios involving multi-document contexts and full-sentence outputs. We introduce the HalluMix Benchmark, a diverse, task-agnostic dataset that includes examples from a range of domains and formats. Using this benchmark, we evaluate seven hallucination detection systemsboth open and closed sourcehighlighting differences in performance across tasks, document lengths, and input representations. Our analysis highlights substantial performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
