Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models
Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy III, Mac Schwager

TL;DR
This paper uses sparse autoencoders to interpret internal features of vision-language-action models, revealing both memorized and generalizable motion primitives, and demonstrates how steering these features can influence robot behavior across tasks.
Contribution
It introduces a mechanistic interpretability approach using sparse autoencoders to identify and steer generalizable features in VLA models, advancing understanding of their internal representations.
Findings
Most SAE features correspond to memorized training sequences.
Some features represent interpretable, general motion primitives.
Steering features causally influences robot behavior across tasks.
Abstract
Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, their generalization is inconsistent: while these models can perform impressively in some settings, fine-tuned variants often fail on novel objects, scenes, and instructions. We apply mechanistic interpretability techniques to better understand the inner workings of VLA models. To probe internal representations, we train Sparse Autoencoders (SAEs) on hidden layer activations of the VLA. SAEs learn a sparse dictionary whose features act as a compact, interpretable basis for the model's computation. We find that the large majority of extracted SAE features correspond to memorized sequences from specific training demonstrations. However, some features correspond to interpretable, general, and steerable motion primitives and semantic properties, offering a promising…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Robot Manipulation and Learning
