A tutorial note on collecting simulated data for vision-language-action models
Heran Wu, Zirun Zhou, Jingfeng Zhang

TL;DR
This paper reviews methods for generating and utilizing high-quality simulated datasets to train vision-language-action models in robotics, emphasizing simulation, benchmarking, and large-scale data collection.
Contribution
It introduces three key systems—PyBullet, LIBERO, and RT-X—for data generation, standardization, and large-scale collection in training VLA models.
Findings
PyBullet enables flexible custom data simulation.
LIBERO provides standardized task benchmarks.
RT-X facilitates large-scale multi-robot data collection.
Abstract
Traditional robotic systems typically decompose intelligence into independent modules for computer vision, natural language processing, and motion control. Vision-Language-Action (VLA) models fundamentally transform this approach by employing a single neural network that can simultaneously process visual observations, understand human instructions, and directly output robot actions -- all within a unified framework. However, these systems are highly dependent on high-quality training datasets that can capture the complex relationships between visual observations, language instructions, and robotic actions. This tutorial reviews three representative systems: the PyBullet simulation framework for flexible customized data generation, the LIBERO benchmark suite for standardized task definition and evaluation, and the RT-X dataset collection for large-scale multi-robot data acquisition. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Natural Language Processing Techniques
