Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

Junyuan Xiao; Dingkang Liang; Xin Zhou; Yixuan Ye; Tongtong Su; Guangmo Yi; Bin Xia; Qiang Lyu; Shurui Shi; Jun Huang; Jianlou Si; Wenming Yang

arXiv:2605.01896·cs.CV·May 5, 2026

Divide and Conquer: Decoupled Representation Alignment for Multimodal World Models

Junyuan Xiao, Dingkang Liang, Xin Zhou, Yixuan Ye, Tongtong Su, Guangmo Yi, Bin Xia, Qiang Lyu, Shurui Shi, Jun Huang, Jianlou Si, Wenming Yang

PDF

TL;DR

This paper introduces M^2-REPA, a novel method for multi-modal video generation that aligns modality-specific features with foundation models to leverage their priors, improving quality and consistency.

Contribution

The paper presents the first representation alignment approach for multi-modal video generation, decoupling features and aligning them with foundation models for enhanced performance.

Findings

01

Outperforms baselines in visual quality

02

Achieves better long-term consistency

03

Effectively leverages foundation model priors

Abstract

Emerging multi-modal world models attempt to jointly generate videos across diverse modalities (e.g., RGB, depth, and mask), yet they fail to fully exploit the rich priors of existing foundation models. We propose $M^{2}$ -REPA, the first representation alignment method tailored for multi-modal video generation. Our key insight is that foundation models trained on different modality spaces naturally capture distinct domain-specific priors, acting as complementary "experts." Specifically, we first decouple modality-specific features from the diffusion model's intermediate representations, then align each with its corresponding expert foundation model. To this end, we design two synergistic objectives: a multi-modal representation alignment loss that enforces feature-to-expert matching, and a modality-specific decoupling regularization that encourages complementarity across different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.