TL;DR
HumANDiff introduces an articulated noise diffusion framework for realistic, motion-consistent human video generation, enhancing control and fidelity without altering existing diffusion models.
Contribution
It proposes a novel articulated noise sampling method, joint appearance-motion learning, and geometric motion consistency to improve human video synthesis.
Findings
Achieves state-of-the-art motion consistency and visual fidelity.
Enables intrinsic motion control during inference.
Supports diverse clothing styles and complex motions.
Abstract
Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
