Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang; Guillaume Le Moing; Skanda Koppula; Ignacio Rocco; Liliane Momeni; Junyu Xie; Shuyang Sun; Rahul Sukthankar; Jo\"elle K. Barral; Raia Hadsell; Zoubin Ghahramani; Andrew Zisserman; Junlin Zhang; Mehdi S. M. Sajjadi

arXiv:2512.08924·cs.CV·December 11, 2025

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Jo\"elle K. Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi S. M. Sajjadi

PDF

Open Access

TL;DR

This paper presents D4RT, a transformer-based model that efficiently reconstructs dynamic scenes from video by jointly estimating depth, motion, and camera parameters, achieving state-of-the-art results with a scalable and lightweight approach.

Contribution

Introducing D4RT, a unified transformer architecture with a novel querying mechanism for efficient 4D scene reconstruction from video.

Findings

01

Outperforms previous methods across multiple 4D reconstruction tasks.

02

Enables efficient training and inference due to its lightweight design.

03

Achieves state-of-the-art accuracy in dynamic scene reconstruction.

Abstract

Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Robotics and Sensor-Based Localization