Improving Viewpoint Consistency in 3D Generation via Structure Feature and CLIP Guidance

Qing Zhang; Jinguang Tong; Jing Zhang; Jie Hong; Xuesong Li

arXiv:2412.02287·cs.CV·August 15, 2025

Improving Viewpoint Consistency in 3D Generation via Structure Feature and CLIP Guidance

Qing Zhang, Jinguang Tong, Jing Zhang, Jie Hong, Xuesong Li

PDF

Open Access

TL;DR

This paper introduces a tuning-free method called ACG that improves viewpoint consistency in text-to-3D generation by controlling attention, filtering viewpoints with CLIP, and refining through staged prompts, effectively reducing the Janus Problem.

Contribution

The paper proposes a novel, tuning-free ACG mechanism that enhances viewpoint accuracy in 3D generation without additional training, addressing the Janus Problem.

Findings

01

Significantly reduces the Janus Problem in 3D generation.

02

Maintains high generation speed while improving viewpoint consistency.

03

Serves as an efficient plug-and-play component for existing frameworks.

Abstract

Despite recent advances in text-to-3D generation techniques, current methods often suffer from geometric inconsistencies, commonly referred to as the Janus Problem. This paper identifies the root cause of the Janus Problem: viewpoint generation bias in diffusion models, which creates a significant gap between the actual generated viewpoint and the expected one required for optimizing the 3D model. To address this issue, we propose a tuning-free approach called the Attention and CLIP Guidance (ACG) mechanism. ACG enhances desired viewpoints by adaptively controlling cross-attention maps, employs CLIP-based view-text similarities to filter out erroneous viewpoints, and uses a coarse-to-fine optimization strategy with staged prompts to progressively refine 3D generation. Extensive experiments demonstrate that our method significantly reduces the Janus Problem without compromising…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques

MethodsSoftmax · Attention Is All You Need · Diffusion · Contrastive Language-Image Pre-training