Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving

Djamahl Etchegaray; Yuxia Fu; Zi Huang; Yadan Luo

arXiv:2507.00525·cs.CV·July 2, 2025

Box-QAymo: Box-Referring VQA Dataset for Autonomous Driving

Djamahl Etchegaray, Yuxia Fu, Zi Huang, Yadan Luo

PDF

Open Access

TL;DR

Box-QAymo introduces a novel dataset and benchmark for evaluating vision-language models in autonomous driving, focusing on localized, user-driven queries involving spatial and temporal reasoning about objects.

Contribution

The paper presents a new box-referring dataset and hierarchical evaluation protocol tailored for autonomous driving scenarios, enabling assessment and fine-tuning of models on complex spatial-temporal reasoning tasks.

Findings

01

Current VLMs show significant limitations in perception-based questions.

02

The dataset captures diverse object classes and attributes relevant to driving.

03

Evaluation reveals gaps in models' ability to understand and reason about dynamic scenes.

Abstract

Interpretable communication is essential for safe and trustworthy autonomous driving, yet current vision-language models (VLMs) often operate under idealized assumptions and struggle to capture user intent in real-world scenarios. Existing driving-oriented VQA datasets are limited to full-scene descriptions or waypoint prediction, preventing the assessment of whether VLMs can respond to localized user-driven queries. We introduce Box-QAymo, a box-referring dataset and benchmark designed to both evaluate and finetune VLMs on spatial and temporal reasoning over user-specified objects. Users express intent by drawing bounding boxes, offering a fast and intuitive interface for focused queries in complex scenes. Specifically, we propose a hierarchical evaluation protocol that begins with binary sanity-check questions to assess basic model capacities, and progresses to (1) attribute…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)