LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin; Zetong Zhou; Xiao Yang; Hao Zhang; Pengfei Liu; Jun Zhu; Zhijie Deng

arXiv:2604.02097·cs.CV·April 3, 2026

LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model

Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Jun Zhu, Zhijie Deng

PDF

1 Repo

TL;DR

LatentUM introduces a shared semantic latent space for unified cross-modal reasoning and generation, improving efficiency, alignment, and performance on visual tasks.

Contribution

It proposes a novel latent-space unified model that eliminates pixel-space mediation, enabling more effective and efficient interleaved cross-modal reasoning and generation.

Findings

01

Achieves state-of-the-art on Visual Spatial Planning benchmark.

02

Enhances visual generation via self-reflection.

03

Supports world modeling by predicting future visual states.

Abstract

Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sjtu-deng-lab/LatentUM
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.