# Mechanistic interpretability for steering vision-language-action models

**Authors:** Bear H\"aon, Kaylene Stocking, Ian Chuang, and Claire Tomlin

arXiv: 2509.00328 · 2025-09-03

## TL;DR

This paper introduces a novel framework for interpreting and steering vision-language-action models by analyzing their internal transformer activations, enabling real-time, zero-shot control of robotic behaviors without additional training.

## Contribution

It presents the first method to interpret and manipulate VLA models internally, allowing direct, causal intervention in their decision-making processes for robotics applications.

## Key findings

- Sparse semantic directions in transformer activations correlate with actions.
- Activation steering enables real-time behavioral modulation.
- Method works on both simulation and physical robots.

## Abstract

Vision-Language-Action (VLA) models are a promising path to realizing generalist embodied agents that can quickly adapt to new tasks, modalities, and environments. However, methods for interpreting and steering VLAs fall far short of classical robotics pipelines, which are grounded in explicit models of kinematics, dynamics, and control. This lack of mechanistic insight is a central challenge for deploying learned policies in real-world robotics, where robustness and explainability are critical. Motivated by advances in mechanistic interpretability for large language models, we introduce the first framework for interpreting and steering VLAs via their internal representations, enabling direct intervention in model behavior at inference time. We project feedforward activations within transformer layers onto the token embedding basis, identifying sparse semantic directions - such as speed and direction - that are causally linked to action selection. Leveraging these findings, we introduce a general-purpose activation steering method that modulates behavior in real time, without fine-tuning, reward signals, or environment interaction. We evaluate this method on two recent open-source VLAs, Pi0 and OpenVLA, and demonstrate zero-shot behavioral control in simulation (LIBERO) and on a physical robot (UR5). This work demonstrates that interpretable components of embodied VLAs can be systematically harnessed for control - establishing a new paradigm for transparent and steerable foundation models in robotics.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2509.00328/full.md

## Figures

35 figures with captions in the complete paper: https://tomesphere.com/paper/2509.00328/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/2509.00328/full.md

---
Source: https://tomesphere.com/paper/2509.00328