Dreamweaver: Learning Compositional World Models from Pixels
Junyeob Baek, Yi-Fu Wu, Gautam Singh, Sungjin Ahn

TL;DR
Dreamweaver introduces a neural architecture that learns hierarchical, compositional representations from raw videos, enabling the generation of novel, recombined future scenes without auxiliary data, advancing video world modeling and imagination.
Contribution
It proposes a novel RBSU module and a multi-future-frame prediction objective for unsupervised learning of compositional video representations.
Findings
Outperforms state-of-the-art in world modeling benchmarks.
Enables recombination of learned attributes for novel video generation.
Demonstrates effective disentanglement of static and dynamic concepts.
Abstract
Humans have an innate ability to decompose their perceptions of the world into objects and their attributes, such as colors, shapes, and movement patterns. This cognitive process enables us to imagine novel futures by recombining familiar concepts. However, replicating this ability in artificial intelligence systems has proven challenging, particularly when it comes to modeling videos into compositional concepts and generating unseen, recomposed futures without relying on auxiliary data, such as text, masks, or bounding boxes. In this paper, we propose Dreamweaver, a neural architecture designed to discover hierarchical and compositional representations from raw videos and generate compositional future simulations. Our approach leverages a novel Recurrent Block-Slot Unit (RBSU) to decompose videos into their constituent objects and attributes. In addition, Dreamweaver uses a…
Peer Reviews
Decision·ICLR 2025 Poster
1. Developed a new module Recurrent Block-Slot Unit (RBSU) to decompose videos. 2. Well-written, easy to follow 3. Experimental results show that Dreamweaver can learn different attributes and freely combine them to generate varied videos.
1. Missing related works: Some related works [1,2,3] also discuss how to use RNNs for composition video generation or use slot attention to learn disentangled representations. The authors could also include these in the related work section. 2. Insufficient comparison: This paper claim to be the first work that can learn both static and dynamic composable concepts in an unsupervised way. But I think Slotformer and Slotdiffusion[3] can do the same decomposition and are not inclued for comparsio
1. Dreamweaver learns compositional world representations without relying on auxiliary data such as text or labeled masks. 2. The proposed RBSU captures both static factors (such as shape) and dynamic factors (such as motion direction), allowing the model to generate new video sequences by recombining learned object attributes. 3. Dreamweaver performs well in new object configurations and arrangements outside the training set, demonstrating strong adaptability. 4. By predicting future frames,
1. The architecture of Dreamweaver relies on complex Recurrent Block Slot Units (RBSUs) and self-regressive Transformer decoders, requiring significant computational resources and memory, especially when processing long video sequences or higher resolution videos. 2. Due to the use of Discrete VAE (dVAE) for image token representation, Dreamweaver may be limited in video generation quality, particularly in applications that require fine visual details. For example, in the Moving-Sprites experim
1. The authors are the first to introduce a method to learning dynamic composable concepts from videos in an unsupervised way on top of static composable concepts while maintaining disentanglement. 2. The authors introduce a novel module Recurrent Block Slot Unit to model dynamic concepts. 3. Instead of the traditional reconstruction objective, the authors use a predictive objective to model dynamic concepts better. 4. The authors demonstrate the effectiveness of their method on their datasets a
1. The compositional imagination evaluation only has qualitative results which while interesting is not very informative about the model's performance relative to the other baselines. Some comparative, quantitative results should help here. For example, the authors can holdout a set of combinations in their dataset during training and evaluate the fidelity and consistency of the imagined results for these unseen combinations using standard generation quality evaluation metrics like FVD (Cobbe et
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
