The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

Shikhar Shiromani; Archie Chaudhury; and Sri Pranav Kunda

arXiv:2602.02496·cs.CL·February 4, 2026

The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

Shikhar Shiromani, Archie Chaudhury, and Sri Pranav Kunda

PDF

Open Access

TL;DR

This paper introduces the Hypocrisy Gap, a metric using Sparse Autoencoders to measure divergence between a language model's internal reasoning and its final output, helping detect unfaithful or hypocritical behavior.

Contribution

The paper presents a novel mechanistic metric, Hypocrisy Gap, that quantifies divergence between internal beliefs and final outputs in LLMs using sparse autoencoders.

Findings

01

Achieves AUROC of 0.55-0.73 for detecting sycophantic behavior

02

Outperforms baseline AUROC of 0.41-0.50

03

Effective across multiple LLMs like Gemma, Llama, and Qwen

Abstract

Large Language Models (LLMs) frequently exhibit unfaithful behavior, producing a final answer that differs significantly from their internal chain of thought (CoT) reasoning in order to appease the user they are conversing with. In order to better detect this behavior, we introduce the Hypocrisy Gap, a mechanistic metric utilizing Sparse Autoencoders (SAEs) to quantify the divergence between a model's internal reasoning and its final generation. By mathematically comparing an internal truth belief, derived via sparse linear probes, to the final generated trajectory in latent space, we quantify and detect a model's tendency to engage in unfaithful behavior. Experiments on Gemma, Llama, and Qwen models using Anthropic's Sycophancy benchmark show that our method achieves an AUROC of 0.55-0.73 for detecting sycophantic runs and 0.55-0.74 for hypocritical cases where the model internally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling