TL;DR
AntiPaSTO is a self-supervised method that steers language models by learning antiparallel representations, requiring minimal supervision and outperforming prompting baselines on multiple benchmarks.
Contribution
It introduces a novel antiparallel representation technique for scalable model steering using only synthetic contrasting word pairs.
Findings
AntiPaSTO outperforms prompting baselines by 6.9x Steering F1 on DailyDilemmas.
It achieves wins on 5 of 6 tested value axes.
Preliminary evidence suggests it maintains bidirectional control.
Abstract
As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
