How Do VLAs Effectively Inherit from VLMs?
Chuheng Zhang, Rushuai Yang, Xiaoyu Chen, Kaixin Wang, Li Zhao, Yi Chen, Jiang Bian

TL;DR
This paper introduces GrinningFace, a diagnostic benchmark for evaluating how effectively vision-language-action models inherit knowledge from vision-language models, highlighting the importance of knowledge transfer techniques for embodied AI.
Contribution
The paper presents a novel emoji tabletop manipulation benchmark and systematically evaluates various knowledge transfer methods for VLA models.
Findings
Preserving VLM priors enhances generalization in embodied control.
Parameter-efficient fine-tuning improves knowledge transfer.
Emoji-based tasks reveal the effectiveness of different transfer techniques.
Abstract
Vision-language-action (VLA) models hold the promise to attain generalizable embodied control. To achieve this, a pervasive paradigm is to leverage the rich vision-semantic priors of large vision-language models (VLMs). However, the fundamental question persists: How do VLAs effectively inherit the prior knowledge from VLMs? To address this critical question, we introduce a diagnostic benchmark, GrinningFace, an emoji tabletop manipulation task where the robot arm is asked to place objects onto printed emojis corresponding to language instructions. This task design is particularly revealing -- knowledge associated with emojis is ubiquitous in Internet-scale datasets used for VLM pre-training, yet emojis themselves are largely absent from standard robotics datasets. Consequently, they provide a clean proxy: successful task completion indicates effective transfer of VLM priors to embodied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Action Observation and Synchronization
