Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Han Lin; Xichen Pan; Ziqi Huang; Ji Hou; Jialiang Wang; Weifeng Chen; Zecheng He; Felix Juefei-Xu; Junzhe Sun; Zhipeng Fan; Ali Thabet; Mohit Bansal; Chu Wang

arXiv:2512.11464·cs.CV·December 15, 2025

Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang

PDF

Open Access

TL;DR

MetaCanvas enables multimodal large language models to directly reason and plan in spatial and spatiotemporal latent spaces, significantly improving control and precision in visual generation tasks by tightly integrating with diffusion models.

Contribution

The paper introduces MetaCanvas, a lightweight framework allowing MLLMs to operate in latent spaces for better visual generation control, bridging the gap between understanding and generation.

Findings

01

MetaCanvas outperforms global-conditioning baselines across six tasks.

02

Empirical evaluation on three diffusion backbones demonstrates versatility.

03

Treating MLLMs as latent-space planners enhances generation precision.

Abstract

Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning