TL;DR
This paper investigates how deceptive instructions cause internal representational shifts in large language models, revealing layer-wise and feature-level signatures that distinguish truthful from deceptive responses.
Contribution
It introduces a method to detect representational flips in LLMs caused by deceptive instructions using linear probes and autoencoders, providing new insights into model dishonesty.
Findings
Deceptive instructions induce significant representational shifts in early-to-mid layers.
Internal representations of truthful and deceptive instructions are distinguishable using linear probes.
Specific features in autoencoders are highly sensitive to deception, enabling potential detection.
Abstract
Large language models (LLMs) tend to follow maliciously crafted instructions to generate deceptive responses, posing safety challenges. How deceptive instructions alter the internal representations of LLM compared to truthful ones remains poorly understood beyond output analysis. To bridge this gap, we investigate when and how these representations ``flip'', such as from truthful to deceptive, under deceptive versus truthful/neutral instructions. Analyzing the internal representations of Llama-3.1-8B-Instruct and Gemma-2-9B-Instruct on a factual verification task, we find the model's instructed True/False output is predictable via linear probes across all conditions based on the internal representation. Further, we use Sparse Autoencoders (SAEs) to show that the Deceptive instructions induce significant representational shifts compared to Truthful/Neutral representations (which are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
