Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-scale Scenes

Tianchen Deng; Nailin Wang; Chongdi Wang; Shenghai Yuan; Jingchuan Wang; Hesheng Wang; Danwei Wang; Weidong Chen

arXiv:2404.06050·cs.CV·December 24, 2025·1 cites

Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-scale Scenes

Tianchen Deng, Nailin Wang, Chongdi Wang, Shenghai Yuan, Jingchuan Wang, Hesheng Wang, Danwei Wang, Weidong Chen

PDF

Open Access

TL;DR

This paper introduces an incremental joint learning framework that simultaneously improves depth, pose estimation, and scene reconstruction for large-scale scenes using a vision transformer backbone and local radiance fields.

Contribution

It presents a novel integrated approach combining transformer-based scale estimation, feature-metric bundle adjustment, and local scene representations for large-scale scene reconstruction.

Findings

01

Enhanced accuracy in depth and pose estimation.

02

Effective large-scale scene reconstruction with local radiance fields.

03

Robust performance demonstrated in extensive experiments.

Abstract

Dense scene reconstruction for photo-realistic view synthesis has various applications, such as VR/AR, autonomous vehicles. However, most existing methods have difficulties in large-scale scenes due to three core challenges: \textit{(a) inaccurate depth input.} Accurate depth input is impossible to get in real-world large-scale scenes. \textit{(b) inaccurate pose estimation.} Most existing approaches rely on accurate pre-estimated camera poses. \textit{(c) insufficient scene representation capability.} A single global radiance field lacks the capacity to effectively scale to large-scale scenes. To this end, we propose an incremental joint learning framework, which can achieve accurate depth, pose estimation, and large-scale scene reconstruction. A vision transformer-based network is adopted as the backbone to enhance performance in scale information estimation. For pose estimation, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Robotics and Sensor-Based Localization · Image Processing Techniques and Applications