Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Jingzhi Bao; Xueting Li; Ming-Hsuan Yang

arXiv:2410.10821·cs.CV·May 6, 2025

Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models

Jingzhi Bao, Xueting Li, Ming-Hsuan Yang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

Tex4D is a novel zero-shot method that combines 3D geometry knowledge with video diffusion models to generate temporally and multi-view consistent 4D textures for mesh sequences from text prompts.

Contribution

It introduces the first approach specifically designed for 4D scene texturing, integrating UV space synchronization and a modified DDIM process for improved consistency.

Findings

01

Produces highly consistent multi-view textures

02

Achieves superior temporal consistency in 4D textures

03

Outperforms existing methods in quality and coherence

Abstract

3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use, playing a crucial role in movies, games, AR, and VR. However, creating temporally consistent and realistic textures for mesh sequences remains labor-intensive for professional artists. On the other hand, while video diffusion models excel at text-driven video generation, they often lack 3D geometry awareness and struggle with achieving multi-view consistent texturing for 3D meshes. In this work, we present Tex4D, a zero-shot approach that integrates inherent 3D geometry knowledge from mesh sequences with the expressiveness of video diffusion models to produce multi-view and temporally consistent 4D textures. Given an untextured mesh sequence and a text prompt as inputs, our method enhances multi-view consistency by synchronizing the diffusion process across different views…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 4

Strengths

See below.

Weaknesses

See below.

Reviewer 02Rating 5Confidence 3

Strengths

1. The authors assert that this is the first method developed specifically for 4D scene texturing. 2. The authors introduce a multi-frame consistent texture generation technique, demonstrating improved consistency in results compared to baseline methods. 3. The paper is fluent and well-written, contributing to its readability and overall clarity.

Weaknesses

1. The generated textures do not blend seamlessly with the background, creating a disjointed appearance that resembles separate foreground and background elements stitched together. 2. Despite claims of multi-view consistency, flickering effects are observed across different views, indicating instability in rendering. 3. Some of the compared methods, such as TokenFlow and Text2Video-Zero, do not utilize mesh or depth inputs, making direct comparisons less equitable.

Reviewer 03Rating 5Confidence 4

Strengths

The paper is the first work to perform video generation based on animated mesh sequences, while its UV mapping strategy ensures multi-view consistency. The experimental results show significant advantages compared to some existing works.

Weaknesses

Under the current pipeline, this work has yielded highly effective results. However, the importance of this pipeline should be further clarified, such as by comparing it with pipelines based on 2D poses or textured meshes. The paper should include more comprehensive comparisons to highlight the contribution of the pipeline. For example, is it a reasonable pipeline to first generate textured meshes and then use animated meshes for video generation?

Code & Models

Repositories

ZqlwMatt/Tex4D
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis

MethodsDiffusion