Hallucination Detection via Activations of Open-Weight Proxy Analyzers
Akshita Singh, Prabesh Paudel, Siddhartha Roy

TL;DR
This paper presents a proxy-analyzer framework that detects hallucinations in large language models by analyzing internal activations of a small open-weight model, effective even with closed APIs like GPT-4.
Contribution
It introduces a new set of transformer-based features and demonstrates a robust ensemble method that outperforms previous hallucination detection approaches across diverse models.
Findings
Consistently outperforms ReDeEP's token-level AUC of 0.73 on RAGTruth.
Achieves an F1 score of 0.717 with Qwen2.5-7B, surpassing ReDeEP.
Model size has limited impact on detection performance, with tight clustering across models.
Abstract
We introduce a proxy-analyzer framework for detecting hallucinations in large language models. Instead of looking inside the generating model, our system reads already-generated text through a small locally hosted open-weight model and spots hallucinations using the reader's own internal activations. This works just as well when the generator is a closed API like GPT-4 as when it is any open-weight model. We built eighteen features grounded in how transformers process text, covering residual stream norms, per-head source-document attention, entropy, MLP activations, logit-lens trajectories, and three new token-level grounding statistics. We trained a stacking ensemble on 72,135 samples from five hallucination datasets. We tested across seven analyzer architectures from 0.5 billion to 9 billion parameters: Qwen2.5 at 0.5B and 7B, Gemma-2 at 2B and 9B, Pythia at 1.4B, and LLaMA-3 at both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
