Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Bryce Grant; Xijia Zhao; Peng Wang

arXiv:2603.19233·cs.RO·March 20, 2026

Not All Features Are Created Equal: A Mechanistic Study of Vision-Language-Action Models

Bryce Grant, Xijia Zhao, Peng Wang

PDF

Open Access

TL;DR

This study investigates how vision-language-action models translate multimodal inputs into actions, revealing the dominance of visual pathways, the context-dependent role of language, and the encoding of motor programs and goal semantics across different architectures.

Contribution

The paper provides a mechanistic analysis of VLA models, highlighting the visual pathway's dominance, the role of language depending on task structure, and the separation of motor and goal representations, supported by extensive experiments.

Findings

01

Visual pathway dominates action generation across architectures.

02

Language sensitivity depends on task structure, not model design.

03

Expert pathways encode motor programs; VLM pathways encode goal semantics.

Abstract

Vision-Language-Action (VLA) models combine perception, language, and motor control in a single architecture, yet how they translate multimodal inputs into actions remains poorly understood. We apply activation injection, sparse autoencoders (SAEs), and linear probes to six models spanning 80M--7B parameters across 394,000+ rollout episodes on four benchmarks. The visual pathway dominates action generation across all architectures: injecting baseline activations into null-prompt episodes recovers near-identical behavior, while cross-task injection steers robots toward source-task positions (99.8\% of X-VLA episodes align with the source trajectory), exposing spatially bound motor programs tied to scene coordinates rather than abstract task representations. Language sensitivity depends on task structure, not model design: when visual context uniquely specifies the task, language is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Action Observation and Synchronization · Robot Manipulation and Learning