VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang; Xiaoyu Chen; Qiuyue Wang; Mingsheng Li; Yanjiang Guo; Yucheng Hu; Jiajun Zhang; Shuai Bai; Junyang Lin; Jianyu Chen

arXiv:2601.03309·cs.CV·January 8, 2026

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces VLM4VLA, a simple yet effective pipeline for converting vision-language models into vision-language-action policies, revealing that VLM capabilities alone do not predict downstream performance and highlighting the importance of visual modules.

Contribution

The paper presents VLM4VLA, a minimal adaptation method for VLMs into VLA policies, and provides extensive empirical analysis on factors affecting downstream embodied control performance.

Findings

01

VLM initialization benefits downstream tasks over training from scratch.

02

VLM general capabilities poorly predict downstream control performance.

03

Visual modules are the main bottleneck in VLM-based embodied policies.

Abstract

Vision-Language-Action (VLA) models, which integrate pretrained large Vision-Language Models (VLM) into their policy backbone, are gaining significant attention for their promising generalization capabilities. This paper revisits a fundamental yet seldom systematically studied question: how VLM choice and competence translate to downstream VLA policies performance? We introduce VLM4VLA, a minimal adaptation pipeline that converts general-purpose VLMs into VLA policies using only a small set of new learnable parameters for fair and efficient comparison. Despite its simplicity, VLM4VLA proves surprisingly competitive with more sophisticated network designs. Through extensive empirical studies on various downstream tasks across three benchmarks, we find that while VLM initialization offers a consistent benefit over training from scratch, a VLM's general capabilities are poor predictors of…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. A meta-analysis of the role of VLMs in VLA models, showing competitive performance with a simplified training framework. 2. A study of the relationship between robot task performance and generic VQA performance. 3. A study of the (lack of) usefulness of VQA data extracted from robot data. 4. An analysis of the importance of fine-tuning the vision encoder, likely due to the domain mismatch between pretraining and robot data.

Weaknesses

1. While the experiments show that some VLMs can achieve competitive performance in the VLM4VLA framework, the paper lacks specific guidelines and insights into why specific models perform better. While the authors found VQA performance to be predictive of Calvin performance, this was not the case for the other benchmarks. This left me wondering how I would choose the next VLM to initialize my VLA from. 2. I am also concerned about the decision of using the same hyperparameters for all the model

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper addresses the important problem of the impact of VLMs on VLA performance, which has been relatively understudied in prior work. 1. The paper shows that general VLM capability does not necessarily correlate with VLA capability. This is an important finding since it contradicts the common intuition that a stronger VLM model is always better for VLAs. For example, recent VLA works use newer VLM bases, which this study shows is not necessarily a good decision. 1. The paper supports its

Weaknesses

1. The importance of the visual encoder can also be explained by several other factors beyond the need to finetune it. (1) The Qwen2.5-VL model is sensitive to image resolution, with higher resolutions using more visual tokens per image and typically producing better performance. It is possible that the visual encoder could be frozen if the image resolution were increased. (2) The VLMs are trained primarily on real images, while the selected benchmarks are not photorealistic and use only simple

Reviewer 03Rating 8Confidence 4

Strengths

1. The paper presents a framework for fairly comparing the performance of different VLMs on VLA tasks and provides an in-depth study into the reasons for performance discrepancies. 2. By using an MLP action head instead of a more complex diffusion-based one, the framework avoids introducing stochasticity. This ensures a "fair and reproducible" comparison across the different VLMs. 3. It systematically proposes three benchmarks for evaluating VLM capabilities: general capability, embodied-specifi

Weaknesses

1. The study lacks real-robot experiments. The sim-to-real gap is a major concern in the VLA field, and this work doesn't clarify how different VLMs might affect the model's final generalization to real-world scenarios. 2. Diffusion action heads and MLP action heads may leverage VLM capabilities differently (e.g., many diffusion heads use the VLM's KV-cache for information interaction). The paper does not directly compare the impact of these two approaches on VLA performance.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Neurobiology of Language and Bilingualism