RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

Pierre Sermanet; Tianli Ding; Jeffrey Zhao; Fei Xia; Debidatta; Dwibedi; Keerthana Gopalakrishnan; Christine Chan; Gabriel Dulac-Arnold,; Sharath Maddineni; Nikhil J Joshi; Pete Florence; Wei Han; Robert Baruch; Yao; Lu; Suvir Mirchandani; Peng Xu; Pannag Sanketi; Karol Hausman; Izhak Shafran,; Brian Ichter; Yuan Cao

arXiv:2311.00899·cs.RO·November 3, 2023·2 cites

RoboVQA: Multimodal Long-Horizon Reasoning for Robotics

Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta, Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold,, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao, Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman

PDF

Open Access

TL;DR

RoboVQA introduces a scalable data collection method and a diverse dataset for high-level robotic visual question answering, enabling improved reasoning and task performance in realistic settings with human oversight.

Contribution

The paper presents a novel scalable data collection scheme, a large diverse dataset, and a video-conditioned model for robotic reasoning, advancing beyond prior narrow or limited approaches.

Findings

01

Models trained on all embodiments outperform robot-only trained models.

02

Combining human and robot data is cost-effective and improves performance.

03

Video VLMs outperform single-image models with 19% error reduction.

Abstract

We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning