SelfCheck-Eval: A Multi-Module Framework for Zero-Resource Hallucination Detection in Large Language Models

Diyana Muhammed; Giusy Giulia Tuccari; Gollam Rabby; S\"oren Auer; Sahar Vahdati

arXiv:2502.01812·cs.CL·December 30, 2025

SelfCheck-Eval: A Multi-Module Framework for Zero-Resource Hallucination Detection in Large Language Models

Diyana Muhammed, Giusy Giulia Tuccari, Gollam Rabby, S\"oren Auer, Sahar Vahdati

PDF

Open Access

TL;DR

This paper introduces SelfCheck-Eval, a multi-module framework for detecting hallucinations in large language models, especially in mathematical reasoning, addressing a critical gap in current benchmarks and detection methods.

Contribution

The paper presents a novel, domain-agnostic hallucination detection framework with a new benchmark dataset for mathematical reasoning hallucinations in LLMs.

Findings

01

Detection methods perform poorly on mathematical reasoning content.

02

Existing benchmarks are limited to general knowledge domains.

03

Systematic performance disparities exist across different content domains.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, from open-domain question answering to scientific writing, medical decision support, and legal analysis. However, their tendency to generate incorrect or fabricated content, commonly known as hallucinations, represents a critical barrier to reliable deployment in high-stakes domains. Current hallucination detection benchmarks are limited in scope, focusing primarily on general-knowledge domains while neglecting specialised fields where accuracy is paramount. To address this gap, we introduce the AIME Math Hallucination dataset, the first comprehensive benchmark specifically designed for evaluating mathematical reasoning hallucinations. Additionally, we propose SelfCheck-Eval, a LLM-agnostic, black-box hallucination detection framework applicable to both open and closed-source LLMs. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Big Data and Digital Economy

MethodsLLaMA