When Truthful Representations Flip Under Deceptive Instructions?

Xianxuan Long; Yao Fu; Runchao Li; Mu Sheng; Haotian Yu; Xiaotian Han; Pan Li

arXiv:2507.22149·cs.AI·October 30, 2025

When Truthful Representations Flip Under Deceptive Instructions?

Xianxuan Long, Yao Fu, Runchao Li, Mu Sheng, Haotian Yu, Xiaotian Han, Pan Li

PDF

1 Video

TL;DR

This paper investigates how deceptive instructions cause internal representational shifts in large language models, revealing layer-wise and feature-level signatures that distinguish truthful from deceptive responses.

Contribution

It introduces a method to detect representational flips in LLMs caused by deceptive instructions using linear probes and autoencoders, providing new insights into model dishonesty.

Findings

01

Deceptive instructions induce significant representational shifts in early-to-mid layers.

02

Internal representations of truthful and deceptive instructions are distinguishable using linear probes.

03

Specific features in autoencoders are highly sensitive to deception, enabling potential detection.

Abstract

Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations ``flip'', such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model's instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When Truthful Representations Flip Under Deceptive Instructions?· underline