The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei; Siyuan Li; Yuhang Xu; Zheng Sun; Junjie Jiang; Hexuan Jin; Caijun Jia; Honghao He; Xinglong Xu; Xi bai; Chang Yu; Yumou Liu; Junnan Zhu; Xuanhe Zhou; Jintao Chen; Xiaobin Hu; Shancheng Pang; Bihui Yu; Ran He; Zhen Lei; Stan Z. Li; Conghui He; Shuicheng Yan; Cheng Tan

arXiv:2602.23152·cs.AI·February 27, 2026

The Trinity of Consistency as a Defining Principle for General World Models

Jingxuan Wei, Siyuan Li, Yuhang Xu, Zheng Sun, Junjie Jiang, Hexuan Jin, Caijun Jia, Honghao He, Xinglong Xu, Xi bai, Chang Yu, Yumou Liu, Junnan Zhu, Xuanhe Zhou, Jintao Chen, Xiaobin Hu, Shancheng Pang, Bihui Yu, Ran He, Zhen Lei, Stan Z. Li, Conghui He, Shuicheng Yan

PDF

Open Access 2 Datasets

TL;DR

This paper proposes the Trinity of Consistency as a foundational principle for developing General World Models, emphasizing semantic, geometric, and causal consistencies, and introduces CoW-Bench for evaluating multimodal models in reasoning and generation tasks.

Contribution

It introduces a theoretical framework based on three types of consistency for world models and presents CoW-Bench, a new benchmark for multi-frame reasoning and generation evaluation.

Findings

01

Unified architectures enable emergent internal world simulators

02

CoW-Bench effectively evaluates multimodal models' reasoning and generation

03

The framework clarifies limitations and guides future model development

Abstract

The construction of World Models capable of learning, simulating, and reasoning about objective physical laws constitutes a foundational challenge in the pursuit of Artificial General Intelligence. Recent advancements represented by video generation models like Sora have demonstrated the potential of data-driven scaling laws to approximate physical dynamics, while the emerging Unified Multimodal Model (UMM) offers a promising architectural paradigm for integrating perception, language, and reasoning. Despite these advances, the field still lacks a principled theoretical framework that defines the essential properties requisite for a General World Model. In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine. Through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation