Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering
Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du

TL;DR
This paper introduces SDCV, a method using sparse autoencoders to denoise concept vectors, significantly enhancing the robustness and success rate of language model steering across diverse datasets.
Contribution
We propose SDCV, a novel autoencoder-based approach that isolates discriminative signals from noise in concept vectors, improving steering effectiveness.
Findings
Improves steering success rates by 4-16% across six concepts
Maintains topic relevance while denoising
Enhances robustness of language model steering
Abstract
Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16\% across six challenging concepts, while maintaining topic relevance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Text and Document Classification Technologies
