Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets
Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R Sanketi, Ken Goldberg

TL;DR
Robo2VLM introduces a large-scale VQA dataset derived from real robot manipulation data, enabling the evaluation and enhancement of vision-language models in understanding complex robotic scenes and tasks.
Contribution
The paper presents Robo2VLM, a novel framework for generating VQA datasets from robot trajectories, bridging robotic manipulation data with vision-language model benchmarking.
Findings
Robo2VLM-1 contains 684,710 questions across 463 scenes.
The dataset covers 3,396 manipulation tasks from 176k trajectories.
Results show improved VLM benchmarking for spatial and interaction reasoning.
Abstract
Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning
