Pippo: High-Resolution Multi-View Humans from a Single Image

Yash Kant; Ethan Weber; Jin Kyu Kim; Rawal Khirodkar; Su; Zhaoen; Julieta Martinez; Igor Gilitschenski; Shunsuke Saito and; Timur Bagautdinov

arXiv:2502.07785·cs.CV·February 12, 2025

Pippo: High-Resolution Multi-View Humans from a Single Image

Yash Kant, Ethan Weber, Jin Kyu Kim, Rawal Khirodkar, Su, Zhaoen, Julieta Martinez, Igor Gilitschenski, Shunsuke Saito and, Timur Bagautdinov

PDF

Open Access

TL;DR

Pippo is a novel generative model that creates high-resolution, multi-view videos of humans from a single image without extra input, using a multi-stage training process and attention biasing for enhanced view synthesis.

Contribution

It introduces a multi-view diffusion transformer trained on large datasets, with novel training and inference techniques for 3D-consistent, high-resolution human generation from a single image.

Findings

01

Outperforms existing methods in multi-view human generation.

02

Generates over 5 times more views than during training.

03

Achieves 1K resolution dense turnaround videos from a single photo.

Abstract

We present Pippo, a generative model capable of producing 1K resolution dense turnaround videos of a person from a single casually clicked photo. Pippo is a multi-view diffusion transformer and does not require any additional inputs - e.g., a fitted parametric model or camera parameters of the input image. We pre-train Pippo on 3B human images without captions, and conduct multi-view mid-training and post-training on studio captured humans. During mid-training, to quickly absorb the studio dataset, we denoise several (up to 48) views at low-resolution, and encode target cameras coarsely using a shallow MLP. During post-training, we denoise fewer views at high-resolution and use pixel-aligned controls (e.g., Spatial anchor and Plucker rays) to enable 3D consistent generations. At inference, we propose an attention biasing technique that allows Pippo to simultaneously generate greater…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Surveillance and Tracking Methods · Advanced Vision and Imaging · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need · Diffusion