Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

Weimin Bai; Yubo Li; Weijian Luo; Zeqiang Lai; Yequan Wang; Wenzheng Chen; He Sun

arXiv:2511.14271·cs.CV·November 19, 2025

Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation

Weimin Bai, Yubo Li, Weijian Luo, Zeqiang Lai, Yequan Wang, Wenzheng Chen, He Sun

PDF

Open Access

TL;DR

VLM3D leverages large vision-language models as semantic and spatial critics to improve the accuracy and coherence of text-to-3D generation, addressing semantic detail and spatial consistency issues.

Contribution

It introduces a dual-query critic signal from VLMs for semantic and spatial evaluation, applicable to both optimization-based and feed-forward 3D generation methods.

Findings

01

Outperforms existing methods on standard benchmarks.

02

Effectively corrects spatial errors during 3D generation.

03

Enhances semantic fidelity in generated 3D models.

Abstract

Text-to-3D generation has advanced rapidly, yet state-of-the-art models, encompassing both optimization-based and feed-forward architectures, still face two fundamental limitations. First, they struggle with coarse semantic alignment, often failing to capture fine-grained prompt details. Second, they lack robust 3D spatial understanding, leading to geometric inconsistencies and catastrophic failures in part assembly and spatial relationships. To address these challenges, we propose VLM3D, a general framework that repurposes large vision-language models (VLMs) as powerful, differentiable semantic and spatial critics. Our core contribution is a dual-query critic signal derived from the VLM's Yes or No log-odds, which assesses both semantic fidelity and geometric coherence. We demonstrate the generality of this guidance signal across two distinct paradigms: (1) As a reward objective for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Shape Modeling and Analysis · Human Motion and Animation · Multimodal Machine Learning Applications