How Do VLAs Effectively Inherit from VLMs?

Chuheng Zhang; Rushuai Yang; Xiaoyu Chen; Kaixin Wang; Li Zhao; Yi Chen; Jiang Bian

arXiv:2511.06619·cs.RO·November 11, 2025

How Do VLAs Effectively Inherit from VLMs?

Chuheng Zhang, Rushuai Yang, Xiaoyu Chen, Kaixin Wang, Li Zhao, Yi Chen, Jiang Bian

PDF

Open Access

TL;DR

This paper introduces GrinningFace, a diagnostic benchmark for evaluating how effectively vision-language-action models inherit knowledge from vision-language models, highlighting the importance of knowledge transfer techniques for embodied AI.

Contribution

The paper presents a novel emoji tabletop manipulation benchmark and systematically evaluates various knowledge transfer methods for VLA models.

Findings

01

Preserving VLM priors enhances generalization in embodied control.

02

Parameter-efficient fine-tuning improves knowledge transfer.

03

Emoji-based tasks reveal the effectiveness of different transfer techniques.

Abstract

Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization