GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Gwanghyun Kim; Xueting Li; Ye Yuan; Koki Nagano; Tianye Li; Jan Kautz; Se Young Chun; Umar Iqbal

arXiv:2505.23085·cs.CV·May 30, 2025

GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

Gwanghyun Kim, Xueting Li, Ye Yuan, Koki Nagano, Tianye Li, Jan Kautz, Se Young Chun, Umar Iqbal

PDF

TL;DR

GeoMan is a novel approach that leverages image-to-video diffusion models to produce accurate, temporally consistent 3D human geometry from monocular videos, addressing data scarcity and size estimation challenges.

Contribution

It introduces a hybrid architecture combining image-based initial estimation with diffusion models, enabling high-quality, temporally consistent 3D human geometry estimation with minimal training data.

Findings

01

Achieves state-of-the-art results in 3D human geometry estimation.

02

Improves temporal consistency and generalization over existing methods.

03

Effectively estimates human size using root-relative depth representation.

Abstract

Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus · Diffusion