GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

Karim Elmaaroufi; Liheng Lai; Justin Svegliato; Yutong Bai; Sanjit A. Seshia; and Matei Zaharia

arXiv:2510.22118·cs.CV·October 29, 2025

GRAID: Enhancing Spatial Reasoning of VLMs Through High-Fidelity Data Generation

Karim Elmaaroufi, Liheng Lai, Justin Svegliato, Yutong Bai, Sanjit A. Seshia, and Matei Zaharia

PDF

5 Datasets 3 Reviews

TL;DR

GRAID introduces a novel method for generating high-quality spatial reasoning datasets for vision-language models using only 2D geometric primitives, significantly improving model understanding and generalization.

Contribution

GRAID leverages 2D bounding boxes to create reliable spatial reasoning datasets, avoiding 3D reconstruction errors and hallucinations, and demonstrates improved model performance on multiple benchmarks.

Findings

01

GRAID datasets achieve 91.16% human-validated accuracy.

02

Models trained on GRAID data show 47.5% and 37.9% accuracy improvements.

03

Enhanced spatial reasoning generalization across multiple question types.

Abstract

Vision Language Models (VLMs) achieve strong performance on many vision-language tasks but often struggle with spatial reasoning $\unicode x 2014$ a prerequisite for many applications. Empirically, we find that a dataset produced by a current training data generation pipeline has a 57.6% human validation rate. These rates stem from current limitations: single-image 3D reconstruction introduces cascading modeling errors and requires wide answer tolerances, while caption-based methods require hyper-detailed annotations and suffer from generative hallucinations. We present GRAID, built on the key insight that qualitative spatial relationships can be reliably determined from 2D geometric primitives alone. By operating exclusively on 2D bounding boxes from standard object detectors, GRAID avoids both 3D reconstruction errors and generative hallucinations, resulting in datasets that are of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

I believe improving the spatial reasoning abilities for VLM is important. I am surprised this pipeline described in this paper hasn't not been proposed (if true). Overall, I believe leveraging 2D models on this purpose technical sounds. Their experimental results show that there are some cross-type transfer (e.g., training on 6 question types improves >10 held-out types), and also boosts public benchmarks such as BLINK and A-OKVQA. Also, I appreciate the human validation results.

Weaknesses

I was surprised to find that such a pipeline has not been studied before—or perhaps I am just not familiar with the relevant literature. I will double-check the related works and with other reviewers on this purpose. The proposed template-based tasks are definitely limited the expression abilities for the datasets and the diversity. Even 91% is not desirable in my mind for dataset quality, and particularly, the spatial reasoning tasks shown in the paper is not that challenging. Extending t

Reviewer 02Rating 2Confidence 4

Strengths

This dataset is very large and seems to be high-quality. If the authors open-source it, it would be a great help to the community. It's also impressive that even though the data is only from the driving domain, it helps improve performance on general tasks.

Weaknesses

1. The method in Algorithm 1 naively uses 2D bounding box alignment to infer "left/right" relationships, ignoring perspective. This is likely to introduce significant label noise in driving scenes by misinterpreting 3D "front/back" configurations as 2D "left/right" ones, leading to dataset inaccuracies. 2. I'm concerned about whether a detector like YOLO can actually tell apart different objects of the same type. For example, can it handle several cars that look almost identical? This must happe

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper proposes a simple yet effective framework to generate high-quality data from only 2D bounding boxes. Although I don't find the data generation pipeline itself to be novel, it does solve the core problems of low data quality of the previous spatial VQA datasets in a simple and intuitive way. 2. The experiments conducted are very sound and support the claim. I'm especially impressed by the human studies showing the flaws of previous VQA datasets, and the 95% accuracy of the proposed d

Weaknesses

1. My main concerns of the proposed pipeline is that it is only evaluated on autonomous driving datasets. The authors claim in section 3.1 that GRAID can also work on detection-model-generated bounding boxes, but it is unclear how much the data quality will degrade when switching from GT detections to model detections. Therefore, I'm concerned about the generalization of the proposed method beyond autonomous driving scenes. One possible experiment the authors can do is: similar to L339-389, the

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.