Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Yudong Jin; Sida Peng; Xuan Wang; Tao Xie; Zhen Xu; Yifan Yang; Yujun Shen; Hujun Bao; Xiaowei Zhou

arXiv:2507.13344·cs.CV·July 18, 2025

Diffuman4D: 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yifan Yang, Yujun Shen, Hujun Bao, Xiaowei Zhou

PDF

Open Access 1 Models 2 Datasets

TL;DR

This paper introduces Diffuman4D, a novel 4D diffusion-based method with a sliding iterative denoising process that significantly improves spatio-temporal consistency in high-fidelity human view synthesis from sparse videos.

Contribution

It proposes a sliding iterative denoising technique for 4D diffusion models, enhancing spatio-temporal consistency and view synthesis quality while maintaining manageable GPU memory usage.

Findings

01

Outperforms existing methods on DNA-Rendering and ActorsHQ datasets.

02

Produces high-quality, consistent novel-view human videos.

03

Demonstrates effective large receptive field through iterative latent grid denoising.

Abstract

This paper addresses the challenge of high-fidelity view synthesis of humans with sparse-view videos as input. Previous methods solve the issue of insufficient observation by leveraging 4D diffusion models to generate videos at novel viewpoints. However, the generated videos from these models often lack spatio-temporal consistency, thus degrading view synthesis quality. In this paper, we propose a novel sliding iterative denoising process to enhance the spatio-temporal consistency of the 4D diffusion model. Specifically, we define a latent grid in which each latent encodes the image, camera pose, and human pose for a certain viewpoint and timestamp, then alternately denoising the latent grid along spatial and temporal dimensions with a sliding window, and finally decode the videos at target viewpoints from the corresponding denoised latents. Through the iterative sliding, information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
krahets/Diffuman4D
model· 39 dl· ♡ 7
39 dl♡ 7

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Video Surveillance and Tracking Methods · Image and Video Quality Assessment