AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Michael J. Clark

arXiv:2601.07473·cs.LG·May 13, 2026

AntiPaSTO: Self-Supervised Honesty Steering via Anti-Parallel Representations

Michael J. Clark

PDF

1 Repo 3 Models

TL;DR

AntiPaSTO is a self-supervised method that steers language models by learning antiparallel representations, requiring minimal supervision and outperforming prompting baselines on multiple benchmarks.

Contribution

It introduces a novel antiparallel representation technique for scalable model steering using only synthetic contrasting word pairs.

Findings

01

AntiPaSTO outperforms prompting baselines by 6.9x Steering F1 on DailyDilemmas.

02

It achieves wins on 5 of 6 tested value axes.

03

Preliminary evidence suggests it maintains bidirectional control.

Abstract

As models grow more capable, humans cannot reliably verify what they say. Scalable steering requires methods that are internal, self-supervised, and transfer out-of-distribution; existing methods satisfy some but not all three. We introduce AntiPaSTO, which separates representations along an antiparallel axis (+1/-1 produce opposite shifts), with coherence constraints preventing collapse. Training uses only two contrasting words inserted into template sentences, with no preference labels. When we use 800 such synthetic pairs on Gemma-3-1B, AntiPaSTO beats prompting baselines by 6.9x Steering F1 on DailyDilemmas and wins on 5 of 6 tested value axes. We also find preliminary evidence that it maintains bidirectional control where prompting triggers refusal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wassname/AntiPaSTO
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.