Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions

Wei Zhao; Gongsheng Li; Zhefei Gong; Pengxiang Ding; Han Zhao; Donglin Wang

arXiv:2505.11214·cs.RO·May 19, 2025

Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions

Wei Zhao, Gongsheng Li, Zhefei Gong, Pengxiang Ding, Han Zhao, Donglin Wang

PDF

Open Access

TL;DR

This paper introduces OE-VLA, a vision-language-action model capable of understanding and executing open-ended multimodal instructions, significantly broadening human-robot interaction capabilities beyond language-only prompts.

Contribution

The paper presents OE-VLA, a novel VLA model that handles diverse multimodal instructions, expanding the scope of robotic understanding and interaction in real-world scenarios.

Findings

01

OE-VLA achieves comparable performance to language-only VLA models.

02

OE-VLA excels in four additional open-ended task categories.

03

The approach broadens human-robot interaction applications.

Abstract

Vision-Language-Action (VLA) models have recently become highly prominent in the field of robotics. Leveraging vision-language foundation models trained on large-scale internet data, the VLA model can generate robotic actions directly from visual observations and human instructions through a single end-to-end neural network. Despite their effectiveness, current VLA models usually accept only one form of human prompting, language instructions, which may constrain their applicability in open-ended human-robot interactions. For example, a user might expect the robot to retrieve an object shown in an image, follow an instruction written on the whiteboard, or imitate a behavior demonstrated in a video, rather than relying solely on language-based descriptions. To address this gap, we introduce OE-VLA, which explores the potential of VLA models for open-ended multimodal instructions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Advanced Neural Network Applications