REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language   Models

Agneet Chatterjee; Yiran Luo; Tejas Gokhale; Yezhou Yang; Chitta Baral

arXiv:2408.02231·cs.CV·August 6, 2024

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models

Agneet Chatterjee, Yiran Luo, Tejas Gokhale, Yezhou Yang, Chitta Baral

PDF

Open Access 1 Datasets

TL;DR

This paper introduces REVISION, a rendering-based framework that enhances spatial reasoning in vision-language models by generating accurate synthetic images, improving spatial fidelity and reasoning capabilities in multimodal tasks.

Contribution

The paper presents REVISION, a novel 3D rendering pipeline that improves spatial fidelity in vision-language models and introduces RevQA, a benchmark for spatial reasoning evaluation.

Findings

01

REVISION improves spatial consistency across models.

02

Synthetic images from REVISION enhance training-free spatial reasoning.

03

State-of-the-art models show limited robustness in complex spatial reasoning.

Abstract

Text-to-Image (T2I) and multimodal large language models (MLLMs) have been adopted in solutions for several computer vision and multimodal learning tasks. However, it has been found that such vision-language models lack the ability to correctly reason over spatial relationships. To tackle this shortcoming, we develop the REVISION framework which improves spatial fidelity in vision-language models. REVISION is a 3D rendering based pipeline that generates spatially accurate synthetic images, given a textual prompt. REVISION is an extendable framework, which currently supports 100+ 3D assets, 11 spatial relationships, all with diverse camera perspectives and backgrounds. Leveraging images from REVISION as additional guidance in a training-free manner consistently improves the spatial consistency of T2I models across all spatial relationships, achieving competitive performance on the VISOR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

revision-t2i/revision-generator
dataset· 1.2k dl
1.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConstraint Satisfaction and Optimization · Geographic Information Systems Studies · Semantic Web and Ontologies