SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Zirui He; Mingyu Jin; Bo Shen; Ali Payani; Yongfeng Zhang; Mengnan Du

arXiv:2505.16188·cs.CL·December 8, 2025

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, Mengnan Du

PDF

Open Access 1 Video

TL;DR

This paper presents SAE-SSV, a supervised steering method in sparse, interpretable latent spaces of language models, enabling more reliable and targeted control over model behaviors with minimal quality loss.

Contribution

Introduces a novel supervised steering approach using sparse autoencoders and linear classifiers to control language model outputs in an interpretable, efficient manner.

Findings

01

Higher success rates in steering tasks

02

Minimal degradation in generation quality

03

Effective control with small subspace dimensions

Abstract

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs) to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and political polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models· underline

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications

MethodsALIGN