10 Open Challenges Steering the Future of Vision-Language-Action Models
Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu

TL;DR
This paper reviews ten key challenges and emerging trends in the development of vision-language-action models, aiming to guide future research in embodied AI.
Contribution
It provides a comprehensive overview of principal milestones and future directions for advancing VLA models in embodied AI.
Findings
Identification of 10 principal milestones in VLA development
Discussion of emerging trends like spatial understanding and data synthesis
Highlighting research avenues to accelerate VLA progress
Abstract
Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Language and cultural evolution
