Toward universal steering and monitoring of AI models

Daniel Beaglehole; Adityanarayanan Radhakrishnan; Enric Boix-Adser\`a; Mikhail Belkin

arXiv:2502.03708·cs.CL·May 30, 2025

Toward universal steering and monitoring of AI models

Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adser\`a, Mikhail Belkin

PDF

Open Access

TL;DR

This paper introduces a scalable method to extract and utilize internal concept representations in large AI models for steering, safety, and capability enhancement, demonstrating transferability and improved monitoring of misaligned content.

Contribution

It presents a novel, scalable approach for extracting linear concept representations in large models, enabling effective steering, transferability, and improved safety monitoring.

Findings

01

Larger models are more steerable.

02

Concept representations transfer across languages.

03

Steering improves model capabilities beyond prompting.

Abstract

Modern AI models contain much of human knowledge, yet understanding of their internal representation of this knowledge remains elusive. Characterizing the structure and properties of this representation will lead to improvements in model capabilities and development of effective safeguards. Building on recent advances in feature learning, we develop an effective, scalable approach for extracting linear representations of general concepts in large-scale AI models (language models, vision-language models, and reasoning models). We show how these representations enable model steering, through which we expose vulnerabilities, mitigate misaligned behaviors, and improve model capabilities. Additionally, we demonstrate that concept representations are remarkably transferable across human languages and combinable to enable multi-concept steering. Through quantitative analysis across hundreds of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Natural Language Processing Techniques