Consistent Human Image and Video Generation with Spatially Conditioned Diffusion
Mingdeng Cao, Chong Mou, Ziyang Yuan, Xintao Wang, Zhaoyang Zhang,, Ying Shan, Yinqiang Zheng

TL;DR
This paper introduces a unified diffusion-based approach for consistent human image and video synthesis that maintains appearance and pose fidelity without domain gaps, using a spatially-conditioned inpainting framework and a causal feature interaction design.
Contribution
The authors propose a novel spatially-conditioned inpainting method with a causal feature interaction framework, enabling unified, flexible, and efficient human image and video generation without additional per-instance fine-tuning.
Findings
Achieves consistent appearance and pose in generated images and videos.
Demonstrates strong generalization to unseen identities and poses.
Outperforms existing methods in quality and consistency.
Abstract
Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Computer Graphics and Visualization Techniques · Generative Adversarial Networks and Image Synthesis
MethodsBalanced Selection · Diffusion · Inpainting
