Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Chenlong Wang; Yuhang Chen; Zhihan Hu; Dongping Chen; Wenhu Chen; Sarah Wiegreffe; Tianyi Zhou

arXiv:2602.02140·cs.CL·February 3, 2026

Quantifying the Gap between Understanding and Generation within Unified Multimodal Models

Chenlong Wang, Yuhang Chen, Zhihan Hu, Dongping Chen, Wenhu Chen, Sarah Wiegreffe, Tianyi Zhou

PDF

Open Access

TL;DR

This paper introduces GapEval, a benchmark to measure the alignment between understanding and generation in unified multimodal models, revealing a persistent gap and disjointed knowledge across modalities.

Contribution

The paper presents GapEval, a novel bidirectional benchmark for quantifying the understanding-generation gap in multimodal models, and provides empirical insights into their limitations.

Findings

01

Significant gap between understanding and generation capabilities in UMMs

02

Knowledge within UMMs remains disjoint across modalities

03

Emergence of capabilities and knowledge is unsynchronized

Abstract

Recent advances in unified multimodal models (UMM) have demonstrated remarkable progress in both understanding and generation tasks. However, whether these two capabilities are genuinely aligned and integrated within a single model remains unclear. To investigate this question, we introduce GapEval, a bidirectional benchmark designed to quantify the gap between understanding and generation capabilities, and quantitatively measure the cognitive coherence of the two "unified" directions. Each question can be answered in both modalities (image and text), enabling a symmetric evaluation of a model's bidirectional inference capability and cross-modal consistency. Experiments reveal a persistent gap between the two directions across a wide range of UMMs with different architectures, suggesting that current models achieve only surface-level unification rather than deep cognitive convergence of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Child and Animal Learning Development · Face Recognition and Perception