Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Haiyan Zhao; Xuansheng Wu; Fan Yang; Bo Shen; Ninghao Liu; Mengnan Du

arXiv:2505.15038·cs.CL·July 31, 2025

Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering

Haiyan Zhao, Xuansheng Wu, Fan Yang, Bo Shen, Ninghao Liu, Mengnan Du

PDF

Open Access 1 Video

TL;DR

This paper introduces SDCV, a method using sparse autoencoders to denoise concept vectors, significantly enhancing the robustness and success rate of language model steering across diverse datasets.

Contribution

We propose SDCV, a novel autoencoder-based approach that isolates discriminative signals from noise in concept vectors, improving steering effectiveness.

Findings

01

Improves steering success rates by 4-16% across six concepts

02

Maintains topic relevance while denoising

03

Enhances robustness of language model steering

Abstract

Linear concept vectors effectively steer LLMs, but existing methods suffer from noisy features in diverse datasets that undermine steering robustness. We propose Sparse Autoencoder-Denoised Concept Vectors (SDCV), which selectively keep the most discriminative SAE latents while reconstructing hidden representations. Our key insight is that concept-relevant signals can be explicitly separated from dataset noise by scaling up activations of top-k latents that best differentiate positive and negative samples. Applied to linear probing and difference-in-mean, SDCV consistently improves steering success rates by 4-16\% across six challenging concepts, while maintaining topic relevance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Denoising Concept Vectors with Sparse Autoencoders for Improved Language Model Steering· underline

Taxonomy

TopicsTopic Modeling · Text and Document Classification Technologies