REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, Chitta Baral

TL;DR
This paper introduces REVISION, a rendering-based framework that enhances spatial reasoning in vision-language models by generating accurate synthetic images, improving spatial fidelity and reasoning capabilities in multimodal tasks.
Contribution
The paper presents REVISION, a novel 3D rendering pipeline that improves spatial fidelity in vision-language models and introduces RevQA, a benchmark for spatial reasoning evaluation.
Findings
REVISION improves spatial consistency across models.
Synthetic images from REVISION enhance training-free spatial reasoning.
State-of-the-art models show limited robustness in complex spatial reasoning.
Abstract
Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Geographic Information Systems Studies · Semantic Web and Ontologies
