Subliminal Steering: Stronger Encoding of Hidden Signals
George Morgulis, John Hewitt

TL;DR
This paper introduces subliminal steering, a method for embedding complex biases into language models via trained steering vectors, demonstrating high transferability, mechanistic transfer, and precise encoding of biases.
Contribution
It extends subliminal learning to multi-word biases, provides mechanistic insights, and shows high-precision encoding of biases in language models.
Findings
Subliminal steering transfers complex multi-word biases.
The steering vector itself is transferred and localized in specific model layers.
Biases are encoded with high cosine similarity to original vectors.
Abstract
Subliminal learning describes a student language model inheriting a behavioral bias by fine-tuning on seemingly innocuous data generated by a biased teacher model. Prior work has begun to characterize this phenomenon but leaves open questions about the scope of signals it can transfer, the mechanisms that explain it, and the precision with which a bias can be encoded by seemingly unrelated data. We tackle all three problems by introducing subliminal steering, a variant of subliminal learning in which the teacher's bias is implemented not via a system prompt, as in prior work, but through a steering vector trained to maximize the likelihood of a set of target samples. First, we show that subliminal steering transfers complex multi-word biases, whereas prior work focused on single-word preferences, demonstrating a large scope of subliminally transferrable signals. Second, we provide…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
