Human Video Generation from a Single Image with 3D Pose and View Control

Tiantian Wang; Chun-Han Yao; Tao Hu; Mallikarjun Byrasandra Ramalinga Reddy; Ming-Hsuan Yang; Varun Jampani

arXiv:2602.21188·cs.CV·February 25, 2026

Human Video Generation from a Single Image with 3D Pose and View Control

Tiantian Wang, Chun-Han Yao, Tao Hu, Mallikarjun Byrasandra Ramalinga Reddy, Ming-Hsuan Yang, Varun Jampani

PDF

Open Access

TL;DR

This paper introduces HVG, a diffusion-based model that generates high-quality, multi-view, and temporally coherent human videos from a single image with 3D pose and view control, addressing key challenges in view consistency and clothing wrinkles.

Contribution

HVG is a novel latent video diffusion model that incorporates articulated pose modulation, view and temporal alignment, and progressive spatio-temporal sampling for improved human video synthesis.

Findings

01

HVG outperforms existing methods in quality and consistency.

02

It effectively generates multi-view, long-duration human videos.

03

The model maintains view and temporal coherence across frames.

Abstract

Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · 3D Shape Modeling and Analysis