GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation
Karim Elmaaroufi, Liheng Lai, Justin Svegliato, Yutong Bai, Sanjit A. Seshia, and Matei Zaharia

TL;DR
GRAID introduces a novel method for generating high-quality spatial reasoning datasets for vision-language models using only 2D geometric primitives, significantly improving model understanding and generalization.
Contribution
GRAID leverages 2D bounding boxes to create reliable spatial reasoning datasets, avoiding 3D reconstruction errors and hallucinations, and demonstrates improved model performance on multiple benchmarks.
Findings
GRAID datasets achieve 91.16% human-validated accuracy.
Models trained on GRAID data show 47.5% and 37.9% accuracy improvements.
Enhanced spatial reasoning generalization across multiple question types.
Abstract
Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoninga prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of…
Peer Reviews
Decision·Submitted to ICLR 2026
I believe improving the spatial reasoning abilities for VLM is important. I am surprised this pipeline described in this paper hasn't not been proposed (if true). Overall, I believe leveraging 2D models on this purpose technical sounds. Their experimental results show that there are some cross-type transfer (e.g., training on 6 question types improves >10 held-out types), and also boosts public benchmarks such as BLINK and A-OKVQA. Also, I appreciate the human validation results.
I was surprised to find that such a pipeline has not been studied before—or perhaps I am just not familiar with the relevant literature. I will double-check the related works and with other reviewers on this purpose. The proposed template-based tasks are definitely limited the expression abilities for the datasets and the diversity. Even 91% is not desirable in my mind for dataset quality, and particularly, the spatial reasoning tasks shown in the paper is not that challenging. Extending t
This dataset is very large and seems to be high-quality. If the authors open-source it, it would be a great help to the community. It's also impressive that even though the data is only from the driving domain, it helps improve performance on general tasks.
1. The method in Algorithm 1 naively uses 2D bounding box alignment to infer "left/right" relationships, ignoring perspective. This is likely to introduce significant label noise in driving scenes by misinterpreting 3D "front/back" configurations as 2D "left/right" ones, leading to dataset inaccuracies. 2. I'm concerned about whether a detector like YOLO can actually tell apart different objects of the same type. For example, can it handle several cars that look almost identical? This must happe
1. The paper proposes a simple yet effective framework to generate high-quality data from only 2D bounding boxes. Although I don't find the data generation pipeline itself to be novel, it does solve the core problems of low data quality of the previous spatial VQA datasets in a simple and intuitive way. 2. The experiments conducted are very sound and support the claim. I'm especially impressed by the human studies showing the flaws of previous VQA datasets, and the 95% accuracy of the proposed d
1. My main concerns of the proposed pipeline is that it is only evaluated on autonomous driving datasets. The authors claim in section 3.1 that GRAID can also work on detection-model-generated bounding boxes, but it is unclear how much the data quality will degrade when switching from GT detections to model detections. Therefore, I'm concerned about the generalization of the proposed method beyond autonomous driving scenes. One possible experiment the authors can do is: similar to L339-389, the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
