Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, Li Yuan

TL;DR
This paper investigates whether understanding in unified multimodal models truly enhances their generative capabilities, revealing a significant gap and proposing methods like Chain-of-Thought to improve reasoning and knowledge transfer.
Contribution
Introduces UniSandbox, a decoupled evaluation framework with synthetic datasets, and demonstrates how explicit reasoning strategies can bridge understanding and generation gaps in multimodal models.
Findings
Explicit Chain-of-Thought improves reasoning generation.
Self-training internalizes reasoning abilities for implicit generation.
Query-based architectures exhibit latent Chain-of-Thought properties.
Abstract
Recent years have witnessed significant progress in Unified Multimodal Models, yet a fundamental question remains: Does understanding truly inform generation? To investigate this, we introduce UniSandbox, a decoupled evaluation framework paired with controlled, synthetic datasets to avoid data leakage and enable detailed analysis. Our findings reveal a significant understanding-generation gap, which is mainly reflected in two key dimensions: reasoning generation and knowledge transfer. Specifically, for reasoning generation tasks, we observe that explicit Chain-of-Thought (CoT) in the understanding module effectively bridges the gap, and further demonstrate that a self-training approach can successfully internalize this ability, enabling implicit reasoning during generation. Additionally, for knowledge transfer tasks, we find that CoT assists the generative process by helping retrieve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · AI-based Problem Solving and Planning
