DyST: Towards Dynamic Neural Scene Representations on Real-World Videos
Maximilian Seitzer, Sjoerd van Steenkiste, Thomas Kipf, Klaus Greff,, Mehdi S. M. Sajjadi

TL;DR
DyST introduces a neural scene representation model that captures 3D structure and dynamics from monocular videos, enabling separate control over scene content, motion, and camera pose for improved visual understanding.
Contribution
The paper presents DyST, a novel neural scene representation model that decomposes monocular videos into content, dynamics, and pose using a co-training scheme and a new synthetic dataset.
Findings
DyST effectively disentangles scene content and motion.
It enables view synthesis with independent control over camera and scene.
The model demonstrates promising results on real-world videos.
Abstract
Visual understanding of the world goes beyond the semantics and flat structure of individual images. In this work, we aim to capture both the 3D structure and dynamics of real-world scenes from monocular real-world videos. Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose. This separation is achieved through a novel co-training scheme on monocular videos and our new synthetic dataset DySO. DyST learns tangible latent representations for dynamic scenes that enable view generation with separate control over the camera and the content of the scene.
Peer Reviews
Decision·ICLR 2024 spotlight
Compared to the state of the art, this work investigates the more difficult setting of estimating moving objects and moving cameras only from a few motions pictures and a monocular camera. Moreover, they remove the assumptions of training one model for each scene. The separation of 3D structure estimation and camera motion is an interesting property of the model. The training tricks illustrated by Eq. 5 and Eq. 6 provide an practical way of enforcing this while still retaining the benefit of e
The amplitude of the motion would probably limits the accuracy of the method. In Fig.7 the motion is tiny, and this is not evaluated by the authors. Although the encoder and decoder architectures are rather small for the "simple" cases covered by the paper, I have concerns on the scalability of this method to more real cases and more complex motions.
- The authors co-train synthetic and real-world datasets to transfer the dynamics and camera control potential of synthetic scenes to natural monocular video and the results shown in Fig. 5 indicate that the model has learned to encode dynamics independently of the camera pose. - Since there is no architectural difference between camera pose and scene dynamics, the authors propose to enforce separation through a novel latent control swap training scheme, and the results in Fig. 3 demonstrate the
- [Generalization in different types of motions] Additional experiments are needed to see if the proposed DyST model can generalize to camera poses and scene dynamics that were not seen during training. so, it would be better to provide qualitative results on how the controlled view looks like when horizontal shifts are input after training without horizontal shifts. (DySO’s camera motions consist of 4 horizontal shifts, panning, zooming motions, and random camera points) - [Cluttered background
1. This paper is well-motivated. The primary goal of this paper is the separation of scene dynamics and camera pose, while most of existing works only cover the static scenes. 2. The authors proposes a novel training scheme that disentangles the camera pose from two views under the same camera while containing a moving object, and disentangles the scene dynamic from two views with still objects while under two different cameras. To fulfill this training strategy, the authors also establish a new
1. The method is quite similar to RUST[1]. The encoder, decoder and camera estimator are almost the same as the ones proposed in RUST. 2. Inference procedure. From the method architecture, the target view is required to obtain the camera latents and dynamic latents. In this case, I wonder if the specific novel view image is needed as the input to generate the novel view? 3. Control the latent code. In Fig7, the authors show the results of controlling camera latent and dynamic latent. The aut
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Softmax · Byte Pair Encoding · Label Smoothing · Adam · Absolute Position Encodings · Residual Connection
