Human4DiT: 360-degree Human Video Generation with 4D Diffusion   Transformer

Ruizhi Shao; Youxin Pang; Zerong Zheng; Jingxiang Sun; Yebin Liu

arXiv:2405.17405·cs.CV·September 25, 2024

Human4DiT: 360-degree Human Video Generation with 4D Diffusion Transformer

Ruizhi Shao, Youxin Pang, Zerong Zheng, Jingxiang Sun, Yebin Liu

PDF

Open Access

TL;DR

This paper introduces Human4DiT, a 4D diffusion transformer framework that generates high-quality, 360-degree human videos from a single image, capturing complex motions and viewpoints with global coherence.

Contribution

The paper proposes a hierarchical 4D transformer architecture combining diffusion models and CNNs for efficient, coherent 360-degree human video synthesis from limited input data.

Findings

01

Successfully generates realistic 360-degree human videos

02

Outperforms previous GAN and diffusion-based methods in motion complexity and viewpoint variation

03

Demonstrates potential for VR and animation applications

Abstract

We present a novel approach for generating 360-degree high-quality, spatio-temporally coherent human videos from a single image. Our framework combines the strengths of diffusion transformers for capturing global correlations across viewpoints and time, and CNNs for accurate condition injection. The core is a hierarchical 4D transformer architecture that factorizes self-attention across views, time steps, and spatial dimensions, enabling efficient modeling of the 4D space. Precise conditioning is achieved by injecting human identity, camera parameters, and temporal signals into the respective transformers. To train this model, we collect a multi-dimensional dataset spanning images, videos, multi-view data, and limited 4D footage, along with a tailored multi-dimensional training strategy. Our approach overcomes the limitations of previous methods based on generative adversarial networks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Motion and Animation · Generative Adversarial Networks and Image Synthesis

MethodsDiffusion