The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods
Arpit Singh Gautam, Kailash Talreja, Saurabh Jha

TL;DR
This paper introduces DiffuTruth, an unsupervised method using diffusion model likelihoods and thermodynamic principles to detect hallucinations in large language models by measuring semantic energy and stability, improving factual accuracy detection.
Contribution
It presents a novel thermodynamics-inspired framework and metrics for fact verification, outperforming existing methods in unsupervised hallucination detection and zero-shot generalization.
Findings
Achieves state-of-the-art AUROC of 0.725 on FEVER
Outperforms baselines by 1.5% in AUROC
Outperforms baselines by over 4% on HOVER dataset
Abstract
Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Misinformation and Its Impacts · Generative Adversarial Networks and Image Synthesis
