IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Parker Liu; Chenxin Li; Zhengxin Li; Yipeng Wu; Wuyang Li; Zhiqin Yang; Zhenyuan Zhang; Yunlong Lin; Sirui Han; Brandon Y. Feng

arXiv:2506.23329·cs.CV·July 1, 2025

IR3D-Bench: Evaluating Vision-Language Model Scene Understanding as Agentic Inverse Rendering

Parker Liu, Chenxin Li, Zhengxin Li, Yipeng Wu, Wuyang Li, Zhiqin Yang, Zhenyuan Zhang, Yunlong Lin, Sirui Han, Brandon Y. Feng

PDF

Open Access 1 Datasets

TL;DR

IR3D-Bench introduces a novel benchmark for vision-language models to demonstrate scene understanding through active creation and inverse rendering, emphasizing tool use and geometric accuracy over passive recognition.

Contribution

The paper presents IR3D-Bench, a new benchmark for evaluating vision-language models' ability to perform agentic inverse rendering using tools, moving beyond traditional descriptive tasks.

Findings

01

Current VLMs show limitations in visual precision during inverse rendering.

02

The benchmark reveals gaps in tool-using capabilities of state-of-the-art models.

03

Metrics assess geometric accuracy, spatial relations, and plausibility.

Abstract

Vision-language models (VLMs) excel at descriptive tasks, but whether they truly understand scenes from visual observations remains uncertain. We introduce IR3D-Bench, a benchmark challenging VLMs to demonstrate understanding through active creation rather than passive recognition. Grounded in the analysis-by-synthesis paradigm, IR3D-Bench tasks Vision-Language Agents (VLAs) with actively using programming and rendering tools to recreate the underlying 3D structure of an input image, achieving agentic inverse rendering through tool use. This "understanding-by-creating" approach probes the tool-using generative capacity of VLAs, moving beyond the descriptive or conversational capacity measured by traditional scene understanding benchmarks. We provide a comprehensive suite of metrics to evaluate geometric accuracy, spatial relations, appearance attributes, and overall plausibility.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Piang/IR3D-bench
dataset· 90 dl
90 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Generative Adversarial Networks and Image Synthesis