TL;DR
ReImagine introduces an image-first approach to controllable human video generation, combining high-quality appearance learning with pose and viewpoint control, temporal refinement, and publicly released resources.
Contribution
It presents a novel pipeline that decouples appearance from motion, integrating pretrained models for high-quality, controllable, and temporally consistent human video synthesis.
Findings
Produces high-quality, temporally consistent videos under diverse poses and viewpoints.
Decouples appearance modeling from temporal consistency for better controllability.
Includes publicly available code, dataset, and auxiliary models.
Abstract
Human video generation remains challenging due to the difficulty of jointly modeling human appearance, motion, and camera viewpoint under limited multi-view data. Existing methods often address these factors separately, resulting in limited controllability or reduced visual quality. We revisit this problem from an image-first perspective, where high-quality human appearance is learned via image generation and used as a prior for video synthesis, decoupling appearance modeling from temporal consistency. We propose a pose- and viewpoint-controllable pipeline that combines a pretrained image backbone with SMPL-X-based motion guidance, together with a training-free temporal refinement stage based on a pretrained video diffusion model. Our method produces high-quality, temporally consistent videos under diverse poses and viewpoints. We also release a canonical human dataset and an auxiliary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
