3D sans 3D Scans: Scalable Pre-training from Video-Generated Point Clouds
Ryousuke Yamada, Kohsuke Ide, Yoshihiro Fukuhara, Hirokatsu Kataoka, Gilles Puy, Andrei Bursuc, Yuki M. Asano

TL;DR
This paper introduces a self-supervised learning framework that leverages unlabeled videos to generate 3D point clouds for training, achieving superior indoor segmentation results without using real 3D scans.
Contribution
The work presents LAM3C, a novel self-supervised method utilizing video-generated point clouds and a noise-regularized loss for scalable 3D representation learning from unlabeled videos.
Findings
Outperforms previous methods on indoor segmentation tasks
Uses only video-generated point clouds without real 3D scans
Introduces RoomTours dataset with 49,219 scenes
Abstract
Despite recent progress in 3D self-supervised learning, collecting large-scale 3D scene scans remains expensive and labor-intensive. In this work, we investigate whether 3D representations can be learned from unlabeled videos recorded without any real 3D sensors. We present Laplacian-Aware Multi-level 3D Clustering with Sinkhorn-Knopp (LAM3C), a self-supervised framework that learns from video-generated point clouds reconstructed from unlabeled videos. We first introduce RoomTours, a video-generated point cloud dataset constructed by collecting room-walkthrough videos from the web (e.g., real-estate tours) and generating 49,219 scenes using an off-the-shelf feed-forward reconstruction model. We also propose a noise-regularized loss that stabilizes representation learning by enforcing local geometric smoothness and ensuring feature stability under noisy point clouds. Remarkably, without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Robotics and Sensor-Based Localization · Human Pose and Action Recognition
