Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs

Yujin Han; Hao Chen; Andi Han; Zhiheng Wang; Xinyu Liu; Yingya Zhang; Shiwei Zhang; Difan Zou

arXiv:2507.16663·cs.CL·September 26, 2025

Turning Internal Gap into Self-Improvement: Promoting the Generation-Understanding Unification in MLLMs

Yujin Han, Hao Chen, Andi Han, Zhiheng Wang, Xinyu Liu, Yingya Zhang, Shiwei Zhang, Difan Zou

PDF

Open Access 3 Reviews

TL;DR

This paper identifies the internal gap between understanding and generation in MLLMs, and proposes a self-improvement framework that leverages understanding to enhance generation, leading to better unification and performance.

Contribution

It reveals the root cause of non-unification in MLLMs as weak generation and introduces a gap-based self-improvement method that improves both generation and understanding.

Findings

01

Self-improvement via understanding scoring enhances generation quality.

02

Generation and understanding co-improve through shared neural dynamics.

03

Curriculum learning further boosts unification and performance.

Abstract

Although unified MLLMs aim to unify generation and understanding, they are considered to exhibit an internal gap, with understanding outperforming generation. Through large-scale evaluation across multiple MLLMs and tasks, we confirm the widespread non-unification of MLLMs, and demonstrate that it indeed stems from weak generation rather than misunderstanding. This finding motivates us to propose a simple yet effective internal gap-based self-improvement framework, which mitigates internal gaps by leveraging stronger understanding to guide weaker generation without relying on any external signals. We validate this strategy through comprehensive experiments: scoring generations with understanding to construct image data for post-training (e.g., SFT and DPO) significantly improves generation while promoting unification. Furthermore, we empirically discover a co-improvement effect of such…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The self-improvement approach, which uses the understanding ability to guide the generation branch, is intuitive and intriguing in concept and effective in practice. 2. The work is generally solid and logically rigorous, with a motivation validated across multiple models, comprehensive experiments conducted on two models evaluated by both the authors’ proposed metrics and existing benchmarks, theoretical interpretation, and empirical validation. One of the relatively unconvincing and inelegan

Weaknesses

The empirical evidence for some of the paper's claims is less conclusive than stated, which lacks further clarification: 1. In Fig.2(c), the claimed "trend of increasing with task difficulty" for the non-unification score is not obvious or monotonic. The variation between models seems to dominate any difficulty-based trend. 2. In Fig.7, the difference in similarity between improved samples and random samples is also not clear, especially for image pairs.

Reviewer 02Rating 6Confidence 4

Strengths

- Rigorous Problem Verification: The paper first confirms the internal gap in MLLMs where "generation is weaker than understanding" through large-scale evaluation (across 6 models and tasks of varying difficulty). To achieve this, the authors innovatively propose a "non-unification score" that does not rely on external evaluators. - Simple and Effective Solution: An "internal gap-based self-improvement" framework is proposed, which leverages the model's own stronger understanding capability to

Weaknesses

- Experimental Results Heavily Rely on a Single, Insufficiently Strong Judge Model:Two of the paper's key conclusions—that the "internal gap stems mainly from weak generation" and the "co-improvement effect"—rely heavily on using Qwen2.5-VL-72B-Instruct as the sole external judge. However, Qwen2.5-VL-72B is not a strong enough multimodal model to serve as a reliable evaluator, especially when dealing with complex or "hard tasks." To enhance the credibility of the experimental results, it is

Reviewer 03Rating 8Confidence 3

Strengths

1. The paper propose an internal gap–based self-improvement framework, which is conceptually simple but novel, requiring no external supervision or reward models. The authors also introduce a new internal evaluation metric (Non-Unification Score) to quantify intra-model consistency. 2. The work provides strong empirical evidence through large-scale experiments on six unified MLLMs and three task difficulty levels (Figure 2). 3. Figures and algorithms are clearly presented — e.g., Algorithm 1

Weaknesses

1. Theory–practice gap: While the shared eNTK explanation is conceptually interesting, its empirical validation remains limited, and the theoretical section is notation-heavy, reducing accessibility for non-theoretical readers. 2. Limited model diversity: Main experiments focus on Janus-Pro and Show-o. Although six models were initially analyzed, most in-depth post-training results come from only two, limiting the generality of conclusions.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsKnowledge Management and Sharing