SelfCheck-Eval: A Multi-Module Framework for Zero-Resource Hallucination Detection in Large Language Models
Diyana Muhammed, Giusy Giulia Tuccari, Gollam Rabby, S\"oren Auer, Sahar Vahdati

TL;DR
This paper introduces SelfCheck-Eval, a multi-module framework for detecting hallucinations in large language models, especially in mathematical reasoning, addressing a critical gap in current benchmarks and detection methods.
Contribution
The paper presents a novel, domain-agnostic hallucination detection framework with a new benchmark dataset for mathematical reasoning hallucinations in LLMs.
Findings
Detection methods perform poorly on mathematical reasoning content.
Existing benchmarks are limited to general knowledge domains.
Systematic performance disparities exist across different content domains.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, from open-domain question answering to scientific writing, medical decision support, and legal analysis. However, their tendency to generate incorrect or fabricated content, commonly known as hallucinations, represents a critical barrier to reliable deployment in high-stakes domains. Current hallucination detection benchmarks are limited in scope, focusing primarily on general-knowledge domains while neglecting specialised fields where accuracy is paramount. To address this gap, we introduce the AIME Math Hallucination dataset, the first comprehensive benchmark specifically designed for evaluating mathematical reasoning hallucinations. Additionally, we propose SelfCheck-Eval, a LLM-agnostic, black-box hallucination detection framework applicable to both open and closed-source LLMs. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Big Data and Digital Economy
MethodsLLaMA
