Semantic Energy: Detecting LLM Hallucination Beyond Entropy

Huan Ma; Jiadong Pan; Jing Liu; Yan Chen; Joey Tianyi Zhou; Guangyu Wang; Qinghua Hu; Hua Wu; Changqing Zhang; Haifeng Wang

arXiv:2508.14496·cs.LG·December 2, 2025

Semantic Energy: Detecting LLM Hallucination Beyond Entropy

Huan Ma, Jiadong Pan, Jing Liu, Yan Chen, Joey Tianyi Zhou, Guangyu Wang, Qinghua Hu, Hua Wu, Changqing Zhang, Haifeng Wang

PDF

Open Access 3 Reviews

TL;DR

Semantic Energy is a new method for detecting hallucinations in LLMs by analyzing model confidence directly from logits, outperforming previous entropy-based approaches in various benchmarks.

Contribution

It introduces Semantic Energy, a novel uncertainty estimation framework that leverages logits and semantic clustering to better detect hallucinations in LLMs.

Findings

01

Semantic Energy improves hallucination detection accuracy.

02

It outperforms semantic entropy in uncertainty estimation.

03

The method enhances reliability for downstream applications.

Abstract

Large Language Models (LLMs) are being increasingly deployed in real-world applications, but they remain susceptible to hallucinations, which produce fluent yet incorrect responses and lead to erroneous decision-making. Uncertainty estimation is a feasible approach to detect such hallucinations. For example, semantic entropy estimates uncertainty by considering the semantic diversity across multiple sampled responses, thus identifying hallucinations. However, semantic entropy relies on post-softmax probabilities and fails to capture the model's inherent uncertainty, causing it to be ineffective in certain scenarios. To address this issue, we introduce Semantic Energy, a novel uncertainty estimation framework that leverages the inherent confidence of LLMs by operating directly on logits of penultimate layer. By combining semantic clustering with a Boltzmann-inspired energy distribution,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

S1. I find the core idea (as described in the summary) interesting (but incremental) and I think it's worthwhile effort to evaluate it.

Weaknesses

W1. Incremental contribution. The core idea is incremental, given existence of Semantic Entropy (SE) (Kuhn et al.), and LogTokU (Ma et al., 2025), which uses logits as UQ-score in NLG settings (albeit the evaluation in that paper is also rather lacking). W2. Lacking appropriate empirical comparison to prior work. W2a. More methods. The key results in Tables 1 and 2 should incorporate at the very least LogTokU, this key ablation cannot be conducted just in the form of Figure 2. I'd par

Reviewer 02Rating 6Confidence 4

Strengths

1. The motivation is clear and well connected to a real shortcoming of Semantic Entropy. 2. The derivation of the energy-based formulation is conceptually elegant and mathematically consistent. 3. The paper is very well written and easy to follow. 4. Experiments show consistent improvements over a strong baseline (Semantic Entropy), across two models and two datasets. 5. The ablations are thorough and empirically convincing. 6. The Fermi-Dirac extension, though exploratory, demonstrates the auth

Weaknesses

1. The empirical scope is relatively narrow: only two models and two QA datasets are tested. While both are multilingual, the generalization to other domains (e.g., reasoning, dialogue, factuality) remains unclear. 2. Comparisons are limited to Semantic Entropy; other strong baselines such as Logit-based OOD detectors, Semantic Uncertainty (Kuhn et al., 2024), Sample Consistency (Lyu et al., 2025), IDK-token (Cohen et al. 2024), or Self-Reflective Uncertainties (Kirchhof et al., 2025) are missin

Reviewer 03Rating 2Confidence 4

Strengths

* UQ for LLMs is an area of growing interest * The authors show improved performance on two datasets, using two models

Weaknesses

* Lack of experiments: The paper only evaluates on two models, using two LLMs, and only compares against (one variant of) semantic entropy. I understand that not all researchers have the same access to resources... but this isn’t nearly enough to evaluate whether semantic energy outperforms semantic entropy. Further, I disagree that it is “sufficient” to compare just with semantic entropy, particularly with so few experiments — it is hard to know whether these are cases where semantic entropy do

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBenford’s Law and Fraud Detection · Complex Systems and Time Series Analysis · Plant-based Medicinal Research