Consistent Human Image and Video Generation with Spatially Conditioned   Diffusion

Mingdeng Cao; Chong Mou; Ziyang Yuan; Xintao Wang; Zhaoyang Zhang,; Ying Shan; Yinqiang Zheng

arXiv:2412.14531·cs.CV·December 20, 2024

Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang,, Ying Shan, Yinqiang Zheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified diffusion-based approach for consistent human image and video synthesis that maintains appearance and pose fidelity without domain gaps, using a spatially-conditioned inpainting framework and a causal feature interaction design.

Contribution

The authors propose a novel spatially-conditioned inpainting method with a causal feature interaction framework, enabling unified, flexible, and efficient human image and video generation without additional per-instance fine-tuning.

Findings

01

Achieves consistent appearance and pose in generated images and videos.

02

Demonstrates strong generalization to unseen identities and poses.

03

Outperforms existing methods in quality and consistency.

Abstract

Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ljzycmd/scd
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis

MethodsBalanced Selection · Diffusion · Inpainting