Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models
Bryce Grant, Xijia Zhao, Peng Wang

TL;DR
This study investigates how vision-language-action models translate multimodal inputs into actions, revealing the dominance of visual pathways, the context-dependent role of language, and the encoding of motor programs and goal semantics across different architectures.
Contribution
The paper provides a mechanistic analysis of VLA models, highlighting the visual pathway's dominance, the role of language depending on task structure, and the separation of motor and goal representations, supported by extensive experiments.
Findings
Visual pathway dominates action generation across architectures.
Language sensitivity depends on task structure, not model design.
Expert pathways encode motor programs; VLM pathways encode goal semantics.
Abstract
Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Robot Manipulation and Learning
