Counterfactual World Models via Digital Twin-conditioned Video Diffusion
Yiqing Shen, Aiza Maksutova, Chenjia Li, Mathias Unberath

TL;DR
This paper introduces CWMDT, a framework that creates digital twins of scenes and uses language models to enable counterfactual reasoning in video diffusion models, allowing for hypothetical scene modifications and predictions.
Contribution
The paper presents a novel method to incorporate explicit scene representations and language reasoning into video diffusion models for counterfactual world modeling.
Findings
Achieves state-of-the-art performance on benchmark tasks.
Enables targeted interventions on scene properties.
Demonstrates effective counterfactual scene prediction.
Abstract
World models learn to predict the temporal evolution of visual observations given a control signal, potentially enabling agents to reason about environments through forward simulation. Because of the focus on forward simulation, current world models generate predictions based on factual observations. For many emerging applications, such as comprehensive evaluations of physical AI behavior under varying conditions, the ability of world models to answer counterfactual queries, such as "what would happen if this object was removed?", is of increasing importance. We formalize counterfactual world models that additionally take interventions as explicit inputs, predicting temporal sequences under hypothetical modifications to observed scene properties. Traditional world models operate directly on entangled pixel-space representations where object properties and relationships cannot be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
