DSPA: Dynamic SAE Steering for Data-Efficient Preference Alignment
James Wedgwood, Aashiq Muhamed, Mona T. Diab, Virginia Smith

TL;DR
DSPA is an inference-time method that enhances preference alignment in language models by steering sparse autoencoders without updating model weights, improving efficiency and robustness across multiple benchmarks.
Contribution
We introduce DSPA, a novel inference-time steering method using sparse autoencoders for preference alignment, reducing compute and increasing mechanistic interpretability.
Findings
DSPA improves performance on MT-Bench and is competitive on AlpacaEval.
DSPA requires up to 4.47 times fewer FLOPs than traditional methods.
Preference directions are mainly influenced by discourse and stylistic signals.
Abstract
Preference alignment is usually achieved by weight-updating training on preference data, which adds substantial alignment-stage compute and provides limited mechanistic visibility. We propose Dynamic SAE Steering for Preference Alignment (DSPA), an inference-time method that makes sparse autoencoder (SAE) steering prompt-conditional. From preference triples, DSPA computes a conditional-difference map linking prompt features to generation-control features; during decoding, it modifies only token-active latents, without base-model weight updates. Across Gemma-2-2B/9B and Qwen3-8B, DSPA improves MT-Bench and is competitive on AlpacaEval while preserving multiple-choice accuracy. Under restricted preference data, DSPA remains robust and can rival the two-stage RAHF-SCIT pipeline while requiring up to fewer alignment-stage FLOPs. Finally, we audit the SAE features DSPA modifies,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Explainable Artificial Intelligence (XAI) · Bayesian Modeling and Causal Inference
