Unveiling the Potential of Vision-Language-Action Models with Open-Ended Multimodal Instructions
Wei Zhao, Gongsheng Li, Zhefei Gong, Pengxiang Ding, Han Zhao, Donglin Wang

TL;DR
This paper introduces OE-VLA, a vision-language-action model capable of understanding and executing open-ended multimodal instructions, significantly broadening human-robot interaction capabilities beyond language-only prompts.
Contribution
The paper presents OE-VLA, a novel VLA model that handles diverse multimodal instructions, expanding the scope of robotic understanding and interaction in real-world scenarios.
Findings
OE-VLA achieves comparable performance to language-only VLA models.
OE-VLA excels in four additional open-ended task categories.
The approach broadens human-robot interaction applications.
Abstract
Vision-Language-Action (VLA) models have recently become highly prominent in the field of robotics. Leveraging vision-language foundation models trained on large-scale internet data, the VLA model can generate robotic actions directly from visual observations and human instructions through a single end-to-end neural network. Despite their effectiveness, current VLA models usually accept only one form of human prompting, language instructions, which may constrain their applicability in open-ended human-robot interactions. For example, a user might expect the robot to retrieve an object shown in an image, follow an instruction written on the whiteboard, or imitate a behavior demonstrated in a video, rather than relying solely on language-based descriptions. To address this gap, we introduce OE-VLA, which explores the potential of VLA models for open-ended multimodal instructions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Advanced Neural Network Applications
