Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

Hongbo Jiang; Jie Li; Yunhang Shen; Pingyang Dai; Xing Sun; Haoyu Cao; Liujuan Cao

arXiv:2602.23711·cs.CV·March 2, 2026

Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities?

Hongbo Jiang, Jie Li, Yunhang Shen, Pingyang Dai, Xing Sun, Haoyu Cao, Liujuan Cao

PDF

Open Access

TL;DR

This paper investigates whether unified multimodal large language models can maintain semantic consistency across text and image outputs, revealing significant challenges in cross-modal semantic alignment despite strong individual modality performance.

Contribution

The study introduces VGUBench, a diagnostic framework to evaluate reasoning and rendering across modalities, exposing a gap in semantic alignment in current U-MLLMs.

Findings

01

Models perform well in textual reasoning and visual rendering.

02

Models fail to maintain semantic equivalence in visual responses.

03

Performance in visual answering is weakly correlated with rendering quality.

Abstract

Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture. However, existing evaluations typically assess these capabilities separately, overlooking semantic equivalence, i.e., the ability to manifest consistent reasoning results regardless of the output modality. In this work, we investigate whether current U-MLLMs satisfy this premise. We observe that while models demonstrate robust textual reasoning, they fail to maintain semantic equivalence when required to render the same results in the image modality. To rigorously diagnose this discrepancy, we introduce VGUBench, a framework to decouple reasoning logic from generation fidelity. VGUBench comprises three diagnostic tasks: (1)Textual Generative Understanding, establishing a baseline for reasoning accuracy in textual response; (2)Visual Generative Understanding, evaluating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling