Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs
Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, Difan Zou

TL;DR
This paper identifies the internal gap between understanding and generation in MLLMs, and proposes a self-improvement framework that leverages understanding to enhance generation, leading to better unification and performance.
Contribution
It reveals the root cause of non-unification in MLLMs as weak generation and introduces a gap-based self-improvement method that improves both generation and understanding.
Findings
Self-improvement via understanding scoring enhances generation quality.
Generation and understanding co-improve through shared neural dynamics.
Curriculum learning further boosts unification and performance.
Abstract
Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large-scale evaluation across multiple MLLMs and tasks, we confirm the widespread non-unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such…
Peer Reviews
Decision·ICLR 2026 Poster
1. The self-improvement approach, which uses the understanding ability to guide the generation branch, is intuitive and intriguing in concept and effective in practice. 2. The work is generally solid and logically rigorous, with a motivation validated across multiple models, comprehensive experiments conducted on two models evaluated by both the authors’ proposed metrics and existing benchmarks, theoretical interpretation, and empirical validation. One of the relatively unconvincing and inelegan
The empirical evidence for some of the paper's claims is less conclusive than stated, which lacks further clarification: 1. In Fig.2(c), the claimed "trend of increasing with task difficulty" for the non-unification score is not obvious or monotonic. The variation between models seems to dominate any difficulty-based trend. 2. In Fig.7, the difference in similarity between improved samples and random samples is also not clear, especially for image pairs.
- Rigorous Problem Verification: The paper first confirms the internal gap in MLLMs where "generation is weaker than understanding" through large-scale evaluation (across 6 models and tasks of varying difficulty). To achieve this, the authors innovatively propose a "non-unification score" that does not rely on external evaluators. - Simple and Effective Solution: An "internal gap-based self-improvement" framework is proposed, which leverages the model's own stronger understanding capability to
- Experimental Results Heavily Rely on a Single, Insufficiently Strong Judge Model:Two of the paper's key conclusions—that the "internal gap stems mainly from weak generation" and the "co-improvement effect"—rely heavily on using Qwen2.5-VL-72B-Instruct as the sole external judge. However, Qwen2.5-VL-72B is not a strong enough multimodal model to serve as a reliable evaluator, especially when dealing with complex or "hard tasks." To enhance the credibility of the experimental results, it is
1. The paper propose an internal gap–based self-improvement framework, which is conceptually simple but novel, requiring no external supervision or reward models. The authors also introduce a new internal evaluation metric (Non-Unification Score) to quantify intra-model consistency. 2. The work provides strong empirical evidence through large-scale experiments on six unified MLLMs and three task difficulty levels (Figure 2). 3. Figures and algorithms are clearly presented — e.g., Algorithm 1
1. Theory–practice gap: While the shared eNTK explanation is conceptually interesting, its empirical validation remains limited, and the theoretical section is notation-heavy, reducing accessibility for non-theoretical readers. 2. Limited model diversity: Main experiments focus on Janus-Pro and Show-o. Although six models were initially analyzed, most in-depth post-training results come from only two, limiting the generality of conclusions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsKnowledge Management and Sharing
