VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation
Wei Zhao, Pengxiang Ding, Min Zhang, Zhefei Gong, Shuanghao Bai, Han, Zhao, Donglin Wang

TL;DR
VLAS is an innovative end-to-end vision-language-action model that integrates speech recognition directly into robot manipulation tasks, enabling natural spoken commands and personalized interactions.
Contribution
The paper introduces VLAS, a novel model that combines speech recognition with vision-language-action capabilities for robots, along with new datasets and a retrieval-augmented generation approach.
Findings
VLAS effectively handles diverse speech commands in robot manipulation.
The model supports multimodal interaction across text, image, speech, and actions.
Experiments demonstrate improved performance and natural interaction.
Abstract
Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language models (VLMs) that only support text-based instructions, neglecting the more natural speech modality for human-robot interaction. Traditional speech integration methods usually involves a separate speech recognition system, which complicates the model and introduces error propagation. Moreover, the transcription procedure would lose non-semantic information in the raw speech, such as voiceprint, which may be crucial for robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, a novel end-to-end VLA that integrates speech recognition directly into the robot policy model. VLAS allows the robot to understand spoken commands through inner speech-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Automated Systems · Robot Manipulation and Learning
