SS3D: End2End Self-Supervised 3D from Web Videos

Marwane Hariat; Gianni Franchi; David Filliat; Antoine Manzanera

arXiv:2604.22686·cs.CV·May 14, 2026

SS3D: End2End Self-Supervised 3D from Web Videos

Marwane Hariat, Gianni Franchi, David Filliat, Antoine Manzanera

PDF

1 Repo

TL;DR

SS3D introduces a self-supervised pretraining pipeline for 3D estimation from web videos, jointly predicting depth, ego-motion, and intrinsics in an end-to-end manner, achieving strong zero-shot transfer and fine-tuning results.

Contribution

The paper presents a scalable web-scale self-supervised 3D pretraining method using SfM, multi-view signal proxy, and expert distillation, with an end-to-end model trained on YouTube-8M.

Findings

01

Pretraining on YouTube-8M improves zero-shot transfer performance.

02

The joint model predicts depth, ego-motion, and intrinsics in a single pass.

03

The approach outperforms prior self-supervised baselines in fine-tuning.

Abstract

We present SS3D, a web-scale SfM-based self-supervision pretraining pipeline for feed-forward 3D estimation from monocular video. Our model jointly predicts depth, ego-motion, and intrinsics in a single forward pass and is trained/evaluated as a coherent end-to-end 3D estimator. To stabilize joint learning, we use an intrinsics-first two-stage schedule and a unified single-checkpoint evaluation protocol. Scaling SfM self-supervision to unconstrained web video is challenging due to weak multi-view observability and strong corpus heterogeneity; we address these with a multi-view signal proxy (MVS) used for filtering and curriculum sampling, and with expert training distilled into a single student. Pretraining on YouTube-8M (~100M frames after filtering) yields strong cross-domain zero-shot transfer and improved fine-tuning performance over prior self-supervised baselines. We release the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.