Interspatial Attention for Efficient 4D Human Video Generation

Ruizhi Shao; Yinghao Xu; Yujun Shen; Ceyuan Yang; Yang Zheng; Changan Chen; Yebin Liu; Gordon Wetzstein

arXiv:2505.15800·cs.CV·May 27, 2025

Interspatial Attention for Efficient 4D Human Video Generation

Ruizhi Shao, Yinghao Xu, Yujun Shen, Ceyuan Yang, Yang Zheng, Changan Chen, Yebin Liu, Gordon Wetzstein

PDF

Open Access

TL;DR

This paper introduces a novel interspatial attention mechanism for diffusion transformer-based models, significantly improving the quality, consistency, and controllability of 4D human video generation.

Contribution

It proposes a new interspatial attention mechanism tailored for human video synthesis within diffusion models, enhancing motion consistency and identity preservation.

Findings

01

Achieves state-of-the-art results in 4D human video synthesis.

02

Demonstrates high motion consistency and identity preservation.

03

Provides precise control over camera and body poses.

Abstract

Generating photorealistic videos of digital humans in a controllable manner is crucial for a plethora of applications. Existing approaches either build on methods that employ template-based 3D representations or emerging video generation models but suffer from poor quality or limited consistency and identity preservation when generating individual or multiple digital humans. In this paper, we introduce a new interspatial attention (ISA) mechanism as a scalable building block for modern diffusion transformer (DiT)--based video generation models. ISA is a new type of cross attention that uses relative positional encodings tailored for the generation of human videos. Leveraging a custom-developed video variation autoencoder, we train a latent ISA-based diffusion model on a large corpus of video data. Our model achieves state-of-the-art performance for 4D human video synthesis,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need · Diffusion