Behavioral Steering in a 35B MoE Language Model via SAE-Decoded Probe Vectors: One Agency Axis, Not Five Traits
Jia Qing Yap

TL;DR
This paper introduces a method to steer a large language model's behavior along a single agency axis by using autoencoders and probe vectors, enabling fine-grained control without retraining.
Contribution
It presents a novel approach to behavioral steering in a 35B MoE language model using SAE-decoded probe vectors, revealing that multiple traits modulate a single dominant agency axis.
Findings
Steering at multiplier 2 significantly increases proactive behavior (d=1.01).
Behavioral modulation primarily affects a single agency axis.
Steering during autoregressive decoding has no effect.
Abstract
We train nine sparse autoencoders (SAEs) on the residual stream of Qwen 3.5-35B-A3B, a 35-billion-parameter Mixture-of-Experts model with a hybrid GatedDeltaNet/attention architecture, and use them to identify and steer five agentic behavioral traits. Our method trains linear probes on SAE latent activations, then projects the probe weights back through the SAE decoder to obtain continuous steering vectors in the model's native activation space. This bypasses the SAE's top-k discretization, enabling fine-grained behavioral intervention at inference time with no retraining. Across 1,800 agent rollouts (50 scenarios times 36 conditions), we find that autonomy steering at multiplier 2 achieves Cohen's d = 1.01 (p < 0.0001), shifting the model from asking the user for help 78% of the time to proactively executing code and searching the web. Cross-trait analysis, however, reveals that all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
