Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models
Jingzhi Bao, Xueting Li, Ming-Hsuan Yang

TL;DR
Tex4D is a novel zero-shot method that combines 3D geometry knowledge with video diffusion models to generate temporally and multi-view consistent 4D textures for mesh sequences from text prompts.
Contribution
It introduces the first approach specifically designed for 4D scene texturing, integrating UV space synchronization and a modified DDIM process for improved consistency.
Findings
Produces highly consistent multi-view textures
Achieves superior temporal consistency in 4D textures
Outperforms existing methods in quality and coherence
Abstract
3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use, playing a crucial role in movies, games, AR, and VR. However, creating temporally consistent and realistic textures for mesh sequences remains labor-intensive for professional artists. On the other hand, while video diffusion models excel at text-driven video generation, they often lack 3D geometry awareness and struggle with achieving multi-view consistent texturing for 3D meshes. In this work, we present Tex4D, a zero-shot approach that integrates inherent 3D geometry knowledge from mesh sequences with the expressiveness of video diffusion models to produce multi-view and temporally consistent 4D textures. Given an untextured mesh sequence and a text prompt as inputs, our method enhances multi-view consistency by synchronizing the diffusion process across different views…
Peer Reviews
Decision·Submitted to ICLR 2025
See below.
See below.
1. The authors assert that this is the first method developed specifically for 4D scene texturing. 2. The authors introduce a multi-frame consistent texture generation technique, demonstrating improved consistency in results compared to baseline methods. 3. The paper is fluent and well-written, contributing to its readability and overall clarity.
1. The generated textures do not blend seamlessly with the background, creating a disjointed appearance that resembles separate foreground and background elements stitched together. 2. Despite claims of multi-view consistency, flickering effects are observed across different views, indicating instability in rendering. 3. Some of the compared methods, such as TokenFlow and Text2Video-Zero, do not utilize mesh or depth inputs, making direct comparisons less equitable.
The paper is the first work to perform video generation based on animated mesh sequences, while its UV mapping strategy ensures multi-view consistency. The experimental results show significant advantages compared to some existing works.
Under the current pipeline, this work has yielded highly effective results. However, the importance of this pipeline should be further clarified, such as by comparing it with pipelines based on 2D poses or textured meshes. The paper should include more comprehensive comparisons to highlight the contribution of the pipeline. For example, is it a reasonable pipeline to first generate textured meshes and then use animated meshes for video generation?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis
MethodsDiffusion
