World Models That Know When They Don't Know - Controllable Video Generation with Calibrated Uncertainty

Zhiting Mei; Tenny Yin; Micah Baker; Ola Shorinwa; Anirudha Majumdar

arXiv:2512.05927·cs.CV·March 12, 2026

World Models That Know When They Don't Know - Controllable Video Generation with Calibrated Uncertainty

Zhiting Mei, Tenny Yin, Micah Baker, Ola Shorinwa, Anirudha Majumdar

PDF

Open Access

TL;DR

This paper introduces C3, a novel uncertainty quantification method for controllable video models that localizes and calibrates confidence estimates at the pixel level, enhancing reliability in video generation tasks.

Contribution

The paper presents a new framework for training video models with calibrated uncertainty estimates in latent space, improving trustworthiness and out-of-distribution detection in controllable video generation.

Findings

01

Provides dense, pixel-level uncertainty heatmaps for generated videos.

02

Achieves calibrated uncertainty estimates within training distribution.

03

Enables effective detection of out-of-distribution video frames.

Abstract

Recent advances in generative video models have led to significant breakthroughs in high-fidelity video synthesis, specifically in controllable video generation where the generated video is conditioned on text and action inputs, e.g., in instruction-guided video editing and world modeling in robotics. Despite these exceptional capabilities, controllable video models often hallucinate - generating future video frames that are misaligned with physical reality - which raises serious concerns in many tasks such as robot policy evaluation and planning. However, state-of-the-art video models lack the ability to assess and express their confidence, impeding hallucination mitigation. To rigorously address this challenge, we propose C3, an uncertainty quantification (UQ) method for training continuous-scale calibrated controllable video models for dense confidence estimation at the subpatch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Vision and Imaging