Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Shaofeng Yin; Jiaxin Ge; Zora Zhiruo Wang; Chenyang Wang; Xiuyu Li; Michael J. Black; Trevor Darrell; Angjoo Kanazawa; Haiwen Feng

arXiv:2601.11109·cs.CV·April 7, 2026

Vision-as-Inverse-Graphics Agent via Interleaved Multimodal Reasoning

Shaofeng Yin, Jiaxin Ge, Zora Zhiruo Wang, Chenyang Wang, Xiuyu Li, Michael J. Black, Trevor Darrell, Angjoo Kanazawa, Haiwen Feng

PDF

1 Datasets

TL;DR

VIGA is a training-free, multimodal reasoning framework that reconstructs images into editable programs through an iterative code-render-inspect loop, enhancing accuracy across diverse visual tasks.

Contribution

Introduces VIGA, a novel, training-free framework for vision-as-inverse-graphics using interleaved multimodal reasoning and a new benchmark, BlenderBench.

Findings

01

VIGA improves accuracy significantly over one-shot baselines.

02

VIGA supports diverse tasks like 2D document generation and 3D reconstruction.

03

Empirical results show substantial accuracy gains on multiple benchmarks.

Abstract

Vision-as-inverse-graphics, the concept of reconstructing images into editable programs, remains challenging for Vision-Language Models (VLMs), which inherently lack fine-grained spatial grounding in one-shot settings. To address this, we introduce VIGA (Vision-as-Inverse-Graphics Agent), an interleaved multimodal reasoning framework where symbolic logic and visual perception actively cross-verify each other. VIGA operates through a tightly coupled code-render-inspect loop: synthesizing symbolic programs, projecting them into visual states, and inspecting discrepancies to guide iterative edits. Equipped with high-level semantic skills and an evolving multimodal memory, VIGA sustains evidence-based modifications over long horizons. This training-free, task-agnostic framework seamlessly supports 2D document generation, 3D reconstruction, multi-step 3D editing, and 4D physical interaction.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

DietCoke4671/BlenderBench
dataset· 286 dl
286 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.