TL;DR
MonoScene introduces a novel monocular 3D semantic scene completion framework that infers dense 3D geometry and semantics from a single RGB image, outperforming existing methods and hallucinating beyond the camera view.
Contribution
The paper presents a new monocular SSC framework with a 2D-3D feature projection, a 3D context relation prior, and novel loss functions, advancing the state-of-the-art in 3D scene understanding from monocular images.
Findings
Outperforms existing methods on all metrics and datasets.
Successfully hallucines plausible scene parts beyond the camera view.
Introduces a novel 2D-3D feature projection and spatio-semantic consistency enforcement.
Abstract
MonoScene proposes a 3D Semantic Scene Completion (SSC) framework, where the dense geometry and semantics of a scene are inferred from a single monocular RGB image. Different from the SSC literature, relying on 2.5 or 3D input, we solve the complex problem of 2D to 3D scene reconstruction while jointly inferring its semantics. Our framework relies on successive 2D and 3D UNets bridged by a novel 2D-3D features projection inspiring from optics and introduces a 3D context relation prior to enforce spatio-semantic consistency. Along with architectural contributions, we introduce novel global scene and local frustums losses. Experiments show we outperform the literature on all metrics and datasets while hallucinating plausible scenery even beyond the camera field of view. Our code and trained models are available at https://github.com/cv-rits/MonoScene.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
