Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
Ziyao Wang, Bingying Wang, Hanrong Zhang, Tingting Du, Tianyang Chen, Guoheng Sun, Yexiao He, Zheyu Shen, Wanghao Ye, Ang Li

TL;DR
This survey emphasizes the importance of data infrastructure, including datasets, benchmarks, and data engines, in advancing Vision-Language-Action models for robotics, highlighting current limitations and future challenges.
Contribution
It provides a systematic, data-centric analysis of VLA research, categorizing datasets, benchmarks, and data engines, and identifies key open challenges for future progress.
Findings
Identifies a fidelity-cost trade-off in datasets affecting large-scale collection.
Exposes gaps in current benchmarks for compositional generalization and reasoning.
Highlights limitations of current data engines in physical grounding and sim-to-real transfer.
Abstract
Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. To this end, we present a systematic, data-centric analysis of VLA research organized around three pillars: datasets, benchmarks, and data engines. For datasets, we categorize real-world and synthetic corpora along embodiment diversity, modality composition, and action space formulation, revealing a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. For benchmarks, we analyze task complexity and environment structure jointly, exposing structural gaps in compositional generalization and long-horizon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
