YouTube-Occ: Learning Indoor 3D Semantic Occupancy Prediction from YouTube Videos
Haoming Chen, Lichen Yuan, TianFang Sun, Jingyu Gong, Xin Tan, Zhizhong Zhang, Yuan Xie

TL;DR
This paper introduces YouTube-Occ, a self-supervised method that learns 3D indoor semantic occupancy prediction from YouTube videos without requiring detailed camera parameters, achieving state-of-the-art zero-shot results.
Contribution
It presents a novel approach leveraging web-sourced indoor videos and vision foundation models for 3D perception without geometric annotations.
Findings
Achieves state-of-the-art zero-shot performance on NYUv2 and OccScanNet.
Demonstrates effective 3D indoor understanding using only internet videos.
Eliminates the need for precise camera calibration in 3D learning.
Abstract
3D semantic occupancy prediction in the past was considered to require precise geometric relationships in order to enable effective training. However, in complex indoor environments, the large-scale and widespread collection of data, along with the necessity for fine-grained annotations, becomes impractical due to the complexity of data acquisition setups and privacy concerns. In this paper, we demonstrate that 3D spatially-accurate training can be achieved using only indoor Internet data, without the need for any pre-knowledge of intrinsic or extrinsic camera parameters. In our framework, we collect a web dataset, YouTube-Occ, which comprises house tour videos from YouTube, providing abundant real house scenes for 3D representation learning. Upon on this web dataset, we establish a fully self-supervised model to leverage accessible 2D prior knowledge for reaching powerful 3D indoor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
