Mind the Performance Gap: Capability-Behavior Trade-offs in Feature Steering
Eitan Sprejer, Oscar Agust\'in Stanchi, Mar\'ia Victoria Carro, Denise Alejandra Mester, Iv\'an Arcuschin

TL;DR
This paper investigates the effectiveness of feature steering in controlling language model behavior, revealing significant trade-offs where improved control leads to notable performance degradation, thus questioning its practicality.
Contribution
It provides the first comprehensive empirical evaluation of feature steering's impact on model performance and highlights fundamental capability-behavior trade-offs in mechanistic control methods.
Findings
Feature steering modifies target behaviors effectively.
Performance drops significantly when controlling behaviors.
Prompting maintains better overall task performance.
Abstract
Feature steering has emerged as a promising approach for controlling LLM behavior through direct manipulation of internal representations, offering advantages over prompt engineering. However, its practical effectiveness in real-world applications remains poorly understood, particularly regarding potential trade-offs with output quality. We show that feature steering methods substantially degrade model performance even when successfully controlling target behaviors, a critical trade-off. Specifically, we evaluate Goodfire's Auto Steer against prompt engineering baselines across 14 steering queries (covering innocuous and safety-relevant behaviors) on 171 Massive Multitask Language Understanding (MMLU) questions using Llama-8B and Llama-70B, measuring accuracy, coherence, and behavioral control. Our findings show that Auto Steer successfully modifies target behaviors (achieving scores of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
