Embodied Scene Understanding for Vision Language Models via MetaVQA

Weizhen Wang; Chenda Duan; Zhenghao Peng; Yuxin Liu; Bolei Zhou

arXiv:2501.09167·cs.CV·January 17, 2025

Embodied Scene Understanding for Vision Language Models via MetaVQA

Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, Bolei Zhou

PDF

Open Access

TL;DR

MetaVQA is a new benchmark for evaluating and improving vision language models' spatial reasoning and scene understanding in embodied AI tasks, especially for autonomous driving scenarios.

Contribution

We introduce MetaVQA, a comprehensive benchmark with question-answer pairs based on real-world traffic data to enhance VLMs' spatial reasoning and scene understanding capabilities.

Findings

01

Fine-tuning VLMs with MetaVQA improves spatial reasoning.

02

Enhanced VQA accuracy and safety-aware driving maneuvers.

03

Strong transferability from simulation to real-world observations.

Abstract

Vision Language Models (VLMs) demonstrate significant potential as embodied AI agents for various mobility applications. However, a standardized, closed-loop benchmark for evaluating their spatial reasoning and sequential decision-making capabilities is lacking. To address this, we present MetaVQA: a comprehensive benchmark designed to assess and enhance VLMs' understanding of spatial relationships and scene dynamics through Visual Question Answering (VQA) and closed-loop simulations. MetaVQA leverages Set-of-Mark prompting and top-down view ground-truth annotations from nuScenes and Waymo datasets to automatically generate extensive question-answer pairs based on diverse real-world traffic scenarios, ensuring object-centric and context-rich instructions. Our experiments show that fine-tuning VLMs with the MetaVQA dataset significantly improves their spatial reasoning and embodied scene…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies · Natural Language Processing Techniques