10 Open Challenges Steering the Future of Vision-Language-Action Models

Soujanya Poria; Navonil Majumder; Chia-Yu Hung; Amir Ali Bagherzadeh; Chuan Li; Kenneth Kwok; Ziwei Wang; Cheston Tan; Jiajun Wu; David Hsu

arXiv:2511.05936·cs.RO·November 11, 2025

10 Open Challenges Steering the Future of Vision-Language-Action Models

Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu

PDF

Open Access 1 Video

TL;DR

This paper reviews ten key challenges and emerging trends in the development of vision-language-action models, aiming to guide future research in embodied AI.

Contribution

It provides a comprehensive overview of principal milestones and future directions for advancing VLA models in embodied AI.

Findings

01

Identification of 10 principal milestones in VLA development

02

Discussion of emerging trends like spatial understanding and data synthesis

03

Highlighting research avenues to accelerate VLA progress

Abstract

Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

10 Open Challenges Steering the Future of Vision-Language-Action Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Language, Metaphor, and Cognition · Language and cultural evolution