World Models That Know When They Don't Know - Controllable Video Generation with Calibrated Uncertainty
Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, Anirudha Majumdar

TL;DR
This paper introduces C3, a novel uncertainty quantification method for controllable video models that localizes and calibrates confidence estimates at the pixel level, enhancing reliability in video generation tasks.
Contribution
The paper presents a new framework for training video models with calibrated uncertainty estimates in latent space, improving trustworthiness and out-of-distribution detection in controllable video generation.
Findings
Provides dense, pixel-level uncertainty heatmaps for generated videos.
Achieves calibrated uncertainty estimates within training distribution.
Enables effective detection of out-of-distribution video frames.
Abstract
Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Vision and Imaging
