RoboVQA: Multimodal Long-Horizon Reasoning for Robotics
Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta, Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold,, Sharath Maddineni, Nikhil J Joshi, Pete Florence, Wei Han, Robert Baruch, Yao, Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman

TL;DR
RoboVQA introduces a scalable data collection method and a diverse dataset for high-level robotic visual question answering, enabling improved reasoning and task performance in realistic settings with human oversight.
Contribution
The paper presents a novel scalable data collection scheme, a large diverse dataset, and a video-conditioned model for robotic reasoning, advancing beyond prior narrow or limited approaches.
Findings
Models trained on all embodiments outperform robot-only trained models.
Combining human and robot data is cost-effective and improves performance.
Video VLMs outperform single-image models with 19% error reduction.
Abstract
We present a scalable, bottom-up and intrinsically diverse data collection scheme that can be used for high-level reasoning with long and medium horizons and that has 2.2x higher throughput compared to traditional narrow top-down step-by-step collection. We collect realistic data by performing any user requests within the entirety of 3 office buildings and using multiple robot and human embodiments. With this data, we show that models trained on all embodiments perform better than ones trained on the robot data only, even when evaluated solely on robot episodes. We find that for a fixed collection budget it is beneficial to take advantage of cheaper human collection along with robot collection. We release a large and highly diverse (29,520 unique instructions) dataset dubbed RoboVQA containing 829,502 (video, text) pairs for robotics-focused visual question answering. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
