UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass

Mengfei Li; Peng Li; Zheng Zhang; Jiahao Lu; Chengfeng Zhao; Wei Xue; Qifeng Liu; Sida Peng; Wenxiao Zhang; Wenhan Luo; Yuan Liu; Yike Guo

arXiv:2601.01222·cs.CV·January 6, 2026

UniSH: Unifying Scene and Human Reconstruction in a Feed-Forward Pass

Mengfei Li, Peng Li, Zheng Zhang, Jiahao Lu, Chengfeng Zhao, Wei Xue, Qifeng Liu, Sida Peng, Wenxiao Zhang, Wenhan Luo, Yuan Liu, Yike Guo

PDF

Open Access

TL;DR

UniSH introduces a novel feed-forward framework that unifies scene and human reconstruction at metric scale, effectively leveraging unlabeled in-the-wild data to improve generalization and fidelity in 3D modeling.

Contribution

The paper proposes a new training paradigm combining distillation and two-stage supervision to enhance 3D scene and human reconstruction from limited real-world data.

Findings

01

Achieves state-of-the-art results in human-centric scene reconstruction.

02

Outperforms existing methods in global human motion estimation.

03

Demonstrates strong generalization to in-the-wild videos.

Abstract

We present UniSH, a unified, feed-forward framework for joint metric-scale 3D scene and human reconstruction. A key challenge in this domain is the scarcity of large-scale, annotated real-world data, forcing a reliance on synthetic datasets. This reliance introduces a significant sim-to-real domain gap, leading to poor generalization, low-fidelity human geometry, and poor alignment on in-the-wild videos. To address this, we propose an innovative training paradigm that effectively leverages unlabeled in-the-wild data. Our framework bridges strong, disparate priors from scene reconstruction and HMR, and is trained with two core components: (1) a robust distillation strategy to refine human surface details by distilling high-frequency details from an expert depth model, and (2) a two-stage supervision scheme, which first learns coarse localization on synthetic data, then fine-tunes on real…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · 3D Shape Modeling and Analysis · Human Motion and Animation