UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Tianxing Xu; Zixuan Wang; Guangyuan Wang; Li Hu; Zhongyi Zhang; Peng Zhang; Bang Zhang; Song-Hai Zhang

arXiv:2602.22960·cs.CV·February 27, 2026

UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models

Tianxing Xu, Zixuan Wang, Guangyuan Wang, Li Hu, Zhongyi Zhang, Peng Zhang, Bang Zhang, Song-Hai Zhang

PDF

Open Access

TL;DR

UCM introduces a unified framework combining long-term memory and camera control in world models, leveraging time-aware positional encoding and efficient diffusion transformers to improve scene consistency and controllability in video generation.

Contribution

The paper proposes UCM, a novel approach that unifies memory and camera control using time-aware encoding and scalable data curation, advancing scene consistency and controllability.

Findings

01

Outperforms state-of-the-art in long-term scene consistency

02

Achieves precise camera control in high-fidelity video generation

03

Efficient dual-stream diffusion transformer reduces computational overhead

Abstract

World models based on video generation demonstrate remarkable potential for simulating interactive environments but face persistent difficulties in two key areas: maintaining long-term content consistency when scenes are revisited and enabling precise camera control from user-provided inputs. Existing methods based on explicit 3D reconstruction often compromise flexibility in unbounded scenarios and fine-grained structures. Alternative methods rely directly on previously generated frames without establishing explicit spatial correspondence, thereby constraining controllability and consistency. To address these limitations, we present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism. To reduce computational overhead, we design an efficient dual-stream diffusion transformer for high-fidelity generation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Advanced Vision and Imaging