Scaling 4D Representations

Jo\~ao Carreira; Dilara Gokay; Michael King; Chuhan Zhang; Ignacio Rocco; Aravindh Mahendran; Thomas Albert Keck; Joseph Heyward; Skanda Koppula; Etienne Pot; Goker Erdogan; Yana Hasson; Yi Yang; Klaus Greff; Guillaume Le Moing; Sjoerd van Steenkiste; Daniel Zoran; Drew A. Hudson; Pedro V\'elez; Luisa Polan\'ia; Luke Friedman; Chris Duvarney; Ross Goroshin; Kelsey Allen; Jacob Walker; Rishabh Kabra; Eric Aboussouan; Jennifer Sun; Thomas Kipf; Carl Doersch; Viorica P\u{a}tr\u{a}ucean; Dima Damen; Pauline Luc; Mehdi S. M. Sajjadi; Andrew Zisserman

arXiv:2412.15212·cs.CV·July 10, 2025

Scaling 4D Representations

Jo\~ao Carreira, Dilara Gokay, Michael King, Chuhan Zhang, Ignacio Rocco, Aravindh Mahendran, Thomas Albert Keck, Joseph Heyward, Skanda Koppula, Etienne Pot, Goker Erdogan, Yana Hasson, Yi Yang, Klaus Greff, Guillaume Le Moing, Sjoerd van Steenkiste, Daniel Zoran

PDF

Open Access

TL;DR

This paper demonstrates that self-supervised learning with masked auto-encoding on large-scale video datasets effectively scales to improve performance on non-semantic 4D vision tasks like pose estimation and tracking, with models up to 22B parameters.

Contribution

It shows that scaling transformer-based video models with self-supervised learning improves performance on 4D spatial-temporal tasks, extending the benefits of scaling beyond semantic tasks.

Findings

01

Performance improves with model size from 20M to 22B parameters.

02

Scaling benefits are consistent across various 4D tasks.

03

Large models outperform previous state-of-the-art on non-semantic video tasks.

Abstract

Scaling has not yet been convincingly demonstrated for pure self-supervised learning from video. However, prior work has focused evaluations on semantic-related tasks $\unicode x 2013$ action classification, ImageNet classification, etc. In this paper we focus on evaluating self-supervised learning on non-semantic vision tasks that are more spatial (3D) and temporal (+1D = 4D), such as camera pose estimation, point and object tracking, and depth estimation. We show that by learning from very large video datasets, masked auto-encoding (MAE) with transformer video models actually scales, consistently improving performance on these 4D tasks, as model size increases from 20M all the way to the largest by far reported self-supervised video model $\unicode x 2013$ 22B parameters. Rigorous apples-to-apples comparison with many recent image and video models demonstrates the benefits of scaling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Modeling in Geospatial Applications · Modular Robots and Swarm Intelligence

MethodsFocus