Inferring Compositional 4D Scenes without Ever Seeing One

Ahmet Berke Gokmen; Ajad Chhatkuli; Luc Van Gool; Danda Pani Paudel

arXiv:2512.05272·cs.CV·March 27, 2026

Inferring Compositional 4D Scenes without Ever Seeing One

Ahmet Berke Gokmen, Ajad Chhatkuli, Luc Van Gool, Danda Pani Paudel

PDF

Open Access 1 Models

TL;DR

COM4D is a novel method that reconstructs 4D scenes with multiple objects from monocular videos without needing 4D training data, by disentangling spatial and temporal learning.

Contribution

It introduces a training framework that learns object composition and dynamics separately, then combines them at inference without 4D supervision.

Findings

01

Achieves state-of-the-art results in 4D object reconstruction

02

Reconstructs complete 4D scenes with multiple objects from monocular videos

03

Does not require 4D compositional training data

Abstract

Scenes in the real world are often composed of several static and dynamic objects. Capturing their 4-dimensional structures, composition and spatio-temporal configuration in-the-wild, though extremely interesting, is equally hard. Therefore, existing works often focus on one object at a time, while relying on some category-specific parametric shape model for dynamic objects. This can lead to inconsistent scene configurations, in addition to being limited to the modeled object categories. We propose COM4D (Compositional 4D), a method that consistently and jointly predicts the structure and spatio-temporal configuration of 4D/3D objects using only static multi-object or dynamic single object supervision. We achieve this by a carefully designed training of spatial and temporal attentions on 2D video input. The training is disentangled into learning from object compositions on the one hand,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
INSAIT-Institute/COM4D
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Advanced Vision and Imaging