Joint Optimization for 4D Human-Scene Reconstruction in the Wild
Zhizheng Liu, Joe Lin, Wayne Wu, Bolei Zhou

TL;DR
This paper introduces JOSH, an optimization-based approach for 4D human-scene reconstruction from monocular videos in natural settings, improving accuracy in motion and scene modeling by joint optimization and a new efficient model, JOSH3R.
Contribution
The paper presents a novel joint optimization framework for 4D human-scene reconstruction in the wild and introduces JOSH3R, a more efficient model trained with pseudo-labels.
Findings
JOSH outperforms previous methods in human motion estimation and scene reconstruction.
JOSH3R achieves higher accuracy than optimization-free methods.
Joint optimization improves reconstruction quality.
Abstract
Reconstructing human motion and its surrounding environment is crucial for understanding human-scene interaction and predicting human movements in the scene. While much progress has been made in capturing human-scene interaction in constrained environments, those prior methods can hardly reconstruct the natural and diverse human motion and scene context from web videos. In this work, we propose JOSH, a novel optimization-based method for 4D human-scene reconstruction in the wild from monocular videos. JOSH uses techniques in both dense scene reconstruction and human mesh recovery as initialization, and then it leverages the human-scene contact constraints to jointly optimize the scene, the camera poses, and the human motion. Experiment results show JOSH achieves better results on both global human motion estimation and dense scene reconstruction by joint optimization of scene geometry…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well written, organized and easy to follow. The ideas are presented clearly and the introduction section particularly is an insightful read. - Joint human-scene modeling: The problem of jointly modeling scenes and humans is ambitious and forward-looking. The proposed approach, JOSH, is promising, shows strong in-the-wild results, and should encourage the community to design integrated human–scene methods. - Simplicity of the approach: The core is simple—and that is its strength
- Limited Technical Novelty (without JOSH3R): I am suprised that JOSH3R's details (the learning based module built on top of the data collected using JOSH) is relegated to the appendix. Without JOSH3R as a core contribution, JOSH remains an optimization-based method for joint human–scene reconstruction; it can at best provide pseudo labels (although viable at scale). Standalone, it offers limited new technical insight. The optimization procedures largely mirror those in Hi4D (CVPR 2023), Ego-Exo
- The proposed method achieves strong quantitative performance across the board. The paper provides comparisons with multiple baselines on a few benchmarks and JOSH outperform previous works in the majority of cases. - Integrating signals for human contact is not easy, because these estimates are very noisy. The paper does this integration carefully, so that JOSH can benefit from the noisy predictions of the off the shelf systems. - There is a helpful ablation that considers different aspects of
- The proposed approach is relatively straightforward for the most part. This type of optimizations are more traditional (e.g., Rempe et al, ICCV 2021 & Ye et al, CVPR 2023), so integrating better initial estimates from off-the-shelf models or refining the optimization objectives will often lead to better results. - The use of human-scene contact relies on the contact being visible in the video. Such contact is often available in the simpler benchmark datasets used for evaluation but less common
1. Clean contact formulation. Explicit vertex–scene distance with visibility + depth-prior gating + a temporal static contact term that mathematically targets sliding, which is practical and easy to implement without a simulator. 2. Focal-length optimization tied to root depth, addressing metric-scale failures when intrinsics are unknown, high leverage for web videos. 3. General wrapper. Boosts multiple scene/human initializers; evaluation spans human, scene, and physics plausibility metrics.
1. No force/stability reasoning. JOSH’s contacts are geometric (distance/static) without force, friction cone; expect residual artifacts under occlusion or weak depth priors (e.g., hand-on-sofa, compliant supports). 2. Contact detection & masking sensitivity. JOSH assumes reliable contact vertices and segmentation; occlusion/noisy masks may yield wrong correspondences, and the method’s robustness to such errors isn’t deeply quantified. 3. Dynamic scenes / non-static supports. geometric matching
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Remote Sensing and LiDAR Applications · Advanced Vision and Imaging
