Hallucination Detection via Activations of Open-Weight Proxy Analyzers

Akshita Singh; Prabesh Paudel; Siddhartha Roy

arXiv:2605.07209·cs.CL·May 11, 2026

Hallucination Detection via Activations of Open-Weight Proxy Analyzers

Akshita Singh, Prabesh Paudel, Siddhartha Roy

PDF

TL;DR

This paper presents a proxy-analyzer framework that detects hallucinations in large language models by analyzing internal activations of a small open-weight model, effective even with closed APIs like GPT-4.

Contribution

It introduces a new set of transformer-based features and demonstrates a robust ensemble method that outperforms previous hallucination detection approaches across diverse models.

Findings

01

Consistently outperforms ReDeEP's token-level AUC of 0.73 on RAGTruth.

02

Achieves an F1 score of 0.717 with Qwen2.5-7B, surpassing ReDeEP.

03

Model size has limited impact on detection performance, with tight clustering across models.

Abstract

We introduce a proxy-analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already-generated text through a small locally hosted open-weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT-4 as when it is any open-weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma-2 at 2B and 9B, Pythia at 1.4B, and LLaMA-3 at both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.