Pippo: High-Resolution Multi-View Humans from a Single Image
Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su, Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito and, Timur Bagautdinov

TL;DR
Pippo is a novel generative model that creates high-resolution, multi-view videos of humans from a single image without extra input, using a multi-stage training process and attention biasing for enhanced view synthesis.
Contribution
It introduces a multi-view diffusion transformer trained on large datasets, with novel training and inference techniques for 3D-consistent, high-resolution human generation from a single image.
Findings
Outperforms existing methods in multi-view human generation.
Generates over 5 times more views than during training.
Achieves 1K resolution dense turnaround videos from a single photo.
Abstract
We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Vision and Imaging · Human Pose and Action Recognition
MethodsSoftmax · Attention Is All You Need · Diffusion
