Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video
Zihui Gao, Ke Liu, Donny Y. Chen, Duochao Shi, Guosheng Lin, Hao Chen, Chunhua Shen

TL;DR
This paper introduces SAGE, a novel framework that leverages Internet videos with weak supervision to scale and improve 3D geometric foundation models, significantly enhancing their generalization capabilities.
Contribution
SAGE is the first scalable method to adapt 3D geometric models from raw internet videos using hierarchical mining and hybrid supervision techniques.
Findings
Reduces Chamfer Distance by 20-42% on benchmarks
Improves zero-shot generalization of 3D models
Establishes a scalable paradigm for 3D learning from videos
Abstract
Geometric foundation models show promise in 3D reconstruction, yet their progress is severely constrained by the scarcity of diverse, large-scale 3D annotations. While Internet videos offer virtually unlimited raw data, utilizing them as a scaling source for geometric learning is challenging due to the absence of ground-truth geometry and the presence of observational noise. To address this, we propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams. SAGE leverages a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision: (1) Informative training trajectory selection; (2) Sparse Geometric Anchoring via SfM point clouds for global structural guidance; and (3) Dense Differentiable Consistency via 3D Gaussian rendering for multi-view constraints. To prevent catastrophic forgetting, we introduce a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Shape Modeling and Analysis · Advanced Vision and Imaging · Robotics and Sensor-Based Localization
