Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection

Yongxin Deng; Zhen Fang; Sharon Li; Ling Chen

arXiv:2601.19245·cs.AI·February 20, 2026

Beyond In-Domain Detection: SpikeScore for Cross-Domain Hallucination Detection

Yongxin Deng, Zhen Fang, Sharon Li, Ling Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SpikeScore, a novel method for cross-domain hallucination detection in large language models, leveraging uncertainty fluctuations in multi-turn dialogues to improve robustness across diverse domains.

Contribution

The paper proposes SpikeScore, a new score based on uncertainty fluctuations, and demonstrates its effectiveness for generalizable hallucination detection across multiple domains and models.

Findings

01

SpikeScore achieves strong cross-domain separability.

02

Outperforms baseline methods in cross-domain detection.

03

Validated across multiple LLMs and benchmarks.

Abstract

Hallucination detection is critical for deploying large language models (LLMs) in real-world applications. Existing hallucination detection methods achieve strong performance when the training and test data come from the same domain, but they suffer from poor cross-domain generalization. In this paper, we study an important yet overlooked problem, termed generalizable hallucination detection (GHD), which aims to train hallucination detectors on data from a single domain while ensuring robust performance across diverse related domains. In studying GHD, we simulate multi-turn dialogues following LLMs' initial response and observe an interesting phenomenon: hallucination-initiated multi-turn dialogues universally exhibit larger uncertainty fluctuations than factual ones across different domains. Based on the phenomenon, we propose a new score SpikeScore, which quantifies abrupt…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 4

Strengths

1. This paper observes that in multi-turn conversations triggered by hallucinations, LLMs frequently engage in self-correction. By employing the maximum second-order difference to measure local fluctuations, this paper provides an effective and novel perspective for cross-domain hallucination detection. 2. Extensive cross-domain experiments have demonstrated the superiority of the proposed SpikeScore. Theoretical analysis additionally provides validation for its effectiveness in distinguishing b

Weaknesses

The claim of cross-domain generalization is not sufficiently supported. Although there are variations in tasks, such as question answering (CommonsenseQA, TriviaQA), reading comprehension (Belebele, CoQA), and mathematical reasoning (Math, SVAMP), the linguistic styles across these datasets may exhibit similarities. More cross-domain scenarios are expected for evaluation.

Reviewer 02Rating 6Confidence 3

Strengths

1. The main idea is based on the observation that in case of hallucination, LLM will produce inconsistent answers in a dialog. The proposed idea is interesting and new. It extends the Consistency-based methods by generating multiple answers in a dialog sequence. This extension is sound. 2. The paper provides a strong motivation from concrete observations. 3. The paper contains theoretical analyses on the method. 4. The experiments show clearly the advantage of SpikeScore.

Weaknesses

1. The experimental results may be presented more clearly. Tables 1 and 2 that describe the main results are quite confusing. The two parts of each table are not explained. It is also unclear how leave-one-out is done for training-based methods. It is said that "training-based methods train on each dataset (columns) while all methods are evaluated on the remaining five datasets". So what is the Mean AURA in SEP under TriviaQA/Llama 3.2-3B? Is this the mean AURA tested on 5 other datasets using t

Reviewer 03Rating 6Confidence 3

Strengths

1. Tackles a realistic and impactful problem setting (generalizable hallucination detection). 2. Novel insight that hallucination triggers instability in multi-turn self-dialogue. 3. Simple but powerful metric (SpikeScore) with no need for additional finetuning. 4. Strong cross-domain results vs powerful baselines (PRISM, ICR probe).

Weaknesses

1. Theory relies on assumptions that may not always hold in real LLM behavior. 2. No evaluation on more complex generative formats (e.g., long-form reasoning beyond QA). 3. Domain choice mainly QA/knowledge tasks; extension to code or vision-language tasks not discussed.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Advanced Graph Neural Networks · Ferroelectric and Negative Capacitance Devices