Mechanistic Knobs in LLMs: Retrieving and Steering High-Order Semantic Features via Sparse Autoencoders
Ruikang Zhang, Shuo Wang, Qi Su

TL;DR
This paper introduces a Sparse Autoencoder framework for retrieving and steering high-level semantic features in LLMs, enabling precise control over linguistic behaviors with improved stability.
Contribution
It presents a novel contrastive retrieval method for internal features, demonstrating bidirectional steering of model behavior and the concept of Functional Faithfulness.
Findings
Enables precise, bidirectional control of linguistic traits in LLMs.
Introduces the concept of Functional Faithfulness for predictable feature interventions.
Outperforms existing activation steering methods like CAA in stability and performance.
Abstract
Recent work in Mechanistic Interpretability (MI) has enabled the identification and intervention of internal features in Large Language Models (LLMs). However, a persistent challenge lies in linking such internal features to the reliable control of complex, behavior-level semantic attributes in language generation. In this paper, we propose a Sparse Autoencoder-based framework for retrieving and steering semantically interpretable internal features associated with high-level linguistic behaviors. Our method employs a contrastive feature retrieval pipeline based on controlled semantic oppositions, combing statistical activation analysis and generation-based validation to distill monosemantic functional features from sparse activation spaces. Using the Big Five personality traits as a case study, we demonstrate that our method enables precise, bidirectional steering of model behavior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
