AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse   Autoencoders

Zhengxuan Wu; Aryaman Arora; Atticus Geiger; Zheng Wang; Jing Huang,; Dan Jurafsky; Christopher D. Manning; Christopher Potts

arXiv:2501.17148·cs.CL·March 5, 2025

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang,, Dan Jurafsky, Christopher D. Manning, Christopher Potts

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper introduces AxBench, a comprehensive benchmark for steering language models, demonstrating that prompting surpasses other methods and proposing ReFT-r1 as a competitive, interpretable alternative.

Contribution

The paper presents AxBench for evaluating steering techniques and introduces ReFT-r1, a novel weakly-supervised method that balances performance and interpretability.

Findings

01

Prompting outperforms all other methods in steering.

02

Representation-based methods excel in concept detection.

03

SAEs are not competitive in the evaluated tasks.

Abstract

Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

stanfordnlp/axbench
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis