VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models
Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

TL;DR
This paper introduces VLM4VLA, a simple yet effective pipeline for converting vision-language models into vision-language-action policies, revealing that VLM capabilities alone do not predict downstream performance and highlighting the importance of visual modules.
Contribution
The paper presents VLM4VLA, a minimal adaptation method for VLMs into VLA policies, and provides extensive empirical analysis on factors affecting downstream embodied control performance.
Findings
VLM initialization benefits downstream tasks over training from scratch.
VLM general capabilities poorly predict downstream control performance.
Visual modules are the main bottleneck in VLM-based embodied policies.
Abstract
Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of…
Peer Reviews
Decision·ICLR 2026 Poster
1. A meta-analysis of the role of VLMs in VLA models, showing competitive performance with a simplified training framework. 2. A study of the relationship between robot task performance and generic VQA performance. 3. A study of the (lack of) usefulness of VQA data extracted from robot data. 4. An analysis of the importance of fine-tuning the vision encoder, likely due to the domain mismatch between pretraining and robot data.
1. While the experiments show that some VLMs can achieve competitive performance in the VLM4VLA framework, the paper lacks specific guidelines and insights into why specific models perform better. While the authors found VQA performance to be predictive of Calvin performance, this was not the case for the other benchmarks. This left me wondering how I would choose the next VLM to initialize my VLA from. 2. I am also concerned about the decision of using the same hyperparameters for all the model
1. The paper addresses the important problem of the impact of VLMs on VLA performance, which has been relatively understudied in prior work. 1. The paper shows that general VLM capability does not necessarily correlate with VLA capability. This is an important finding since it contradicts the common intuition that a stronger VLM model is always better for VLAs. For example, recent VLA works use newer VLM bases, which this study shows is not necessarily a good decision. 1. The paper supports its
1. The importance of the visual encoder can also be explained by several other factors beyond the need to finetune it. (1) The Qwen2.5-VL model is sensitive to image resolution, with higher resolutions using more visual tokens per image and typically producing better performance. It is possible that the visual encoder could be frozen if the image resolution were increased. (2) The VLMs are trained primarily on real images, while the selected benchmarks are not photorealistic and use only simple
1. The paper presents a framework for fairly comparing the performance of different VLMs on VLA tasks and provides an in-depth study into the reasons for performance discrepancies. 2. By using an MLP action head instead of a more complex diffusion-based one, the framework avoids introducing stochasticity. This ensures a "fair and reproducible" comparison across the different VLMs. 3. It systematically proposes three benchmarks for evaluating VLM capabilities: general capability, embodied-specifi
1. The study lacks real-robot experiments. The sim-to-real gap is a major concern in the VLA field, and this work doesn't clarify how different VLMs might affect the model's final generalization to real-world scenarios. 2. Diffusion action heads and MLP action heads may leverage VLM capabilities differently (e.g., many diffusion heads use the VLM's KV-cache for information interaction). The paper does not directly compare the impact of these two approaches on VLA performance.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Neurobiology of Language and Bilingualism
