Toward universal steering and monitoring of AI models
Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adser\`a, Mikhail Belkin

TL;DR
This paper introduces a scalable method to extract and utilize internal concept representations in large AI models for steering, safety, and capability enhancement, demonstrating transferability and improved monitoring of misaligned content.
Contribution
It presents a novel, scalable approach for extracting linear concept representations in large models, enabling effective steering, transferability, and improved safety monitoring.
Findings
Larger models are more steerable.
Concept representations transfer across languages.
Steering improves model capabilities beyond prompting.
Abstract
Modern AI models contain much of human knowledge, yet understanding of their internal representation of this knowledge remains elusive. Characterizing the structure and properties of this representation will lead to improvements in model capabilities and development of effective safeguards. Building on recent advances in feature learning, we develop an effective, scalable approach for extracting linear representations of general concepts in large-scale AI models (language models, vision-language models, and reasoning models). We show how these representations enable model steering, through which we expose vulnerabilities, mitigate misaligned behaviors, and improve model capabilities. Additionally, we demonstrate that concept representations are remarkably transferable across human languages and combinable to enable multi-concept steering. Through quantitative analysis across hundreds of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Natural Language Processing Techniques
