Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

Kaiyuan Chen; Shuangyu Xie; Zehan Ma; Pannag R Sanketi; Ken Goldberg

arXiv:2505.15517·cs.RO·June 23, 2025

Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets

Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R Sanketi, Ken Goldberg

PDF

Open Access 3 Datasets

TL;DR

Robo2VLM introduces a large-scale VQA dataset derived from real robot manipulation data, enabling the evaluation and enhancement of vision-language models in understanding complex robotic scenes and tasks.

Contribution

The paper presents Robo2VLM, a novel framework for generating VQA datasets from robot trajectories, bridging robotic manipulation data with vision-language model benchmarking.

Findings

01

Robo2VLM-1 contains 684,710 questions across 463 scenes.

02

The dataset covers 3,396 manipulation tasks from 176k trajectories.

03

Results show improved VLM benchmarking for spatial and interaction reasoning.

Abstract

Vision-Language Models (VLMs) acquire real-world knowledge and general reasoning ability through Internet-scale image-text corpora. They can augment robotic systems with scene understanding and task planning, and assist visuomotor policies that are trained on robot trajectory data. We explore the reverse paradigm - using rich, real, multi-modal robot trajectory data to enhance and evaluate VLMs. In this paper, we present Robo2VLM, a Visual Question Answering (VQA) dataset generation framework for VLMs. Given a human tele-operated robot trajectory, Robo2VLM derives ground-truth from non-visual and non-descriptive sensory modalities, such as end-effector pose, gripper aperture, and force sensing. Based on these modalities, it segments the robot trajectory into a sequence of manipulation phases. At each phase, Robo2VLM uses scene and interaction understanding to identify 3D properties of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Domain Adaptation and Few-Shot Learning