Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Ziyao Wang; Bingying Wang; Hanrong Zhang; Tingting Du; Tianyang Chen; Guoheng Sun; Yexiao He; Zheyu Shen; Wanghao Ye; Ang Li

arXiv:2604.23001·cs.RO·April 28, 2026

Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Ziyao Wang, Bingying Wang, Hanrong Zhang, Tingting Du, Tianyang Chen, Guoheng Sun, Yexiao He, Zheyu Shen, Wanghao Ye, Ang Li

PDF

TL;DR

This survey emphasizes the importance of data infrastructure, including datasets, benchmarks, and data engines, in advancing Vision-Language-Action models for robotics, highlighting current limitations and future challenges.

Contribution

It provides a systematic, data-centric analysis of VLA research, categorizing datasets, benchmarks, and data engines, and identifies key open challenges for future progress.

Findings

01

Identifies a fidelity-cost trade-off in datasets affecting large-scale collection.

02

Exposes gaps in current benchmarks for compositional generalization and reasoning.

03

Highlights limitations of current data engines in physical grounding and sim-to-real transfer.

Abstract

Despite remarkable progress in Vision--Language--Action (VLA) models, a central bottleneck remains underexamined: the data infrastructure that underlies embodied learning. In this survey, we argue that future advances in VLA will depend less on model architecture and more on the co-design of high-fidelity data engines and structured evaluation protocols. To this end, we present a systematic, data-centric analysis of VLA research organized around three pillars: datasets, benchmarks, and data engines. For datasets, we categorize real-world and synthetic corpora along embodiment diversity, modality composition, and action space formulation, revealing a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. For benchmarks, we analyze task complexity and environment structure jointly, exposing structural gaps in compositional generalization and long-horizon…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.