UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen; Boxiu Li; Wanbo Zhang; Junxiang Lei; Xiaoyu Chen; Yijia Fan; Qi Zhang; Yujiang Wang; Lili Qiu; Bo Li; Ziwei Liu; Caihua Shan; Yifan Yang; Yifei Shen

arXiv:2603.03241·cs.CV·March 4, 2026

UniG2U-Bench: Do Unified Models Advance Multimodal Understanding?

Zimo Wen, Boxiu Li, Wanbo Zhang, Junxiang Lei, Xiaoyu Chen, Yijia Fan, Qi Zhang, Yujiang Wang, Lili Qiu, Bo Li, Ziwei Liu, Caihua Shan, Yifan Yang, Yifei Shen

PDF

Open Access

TL;DR

This paper introduces UniG2U-Bench, a comprehensive benchmark to evaluate whether generation capabilities in unified multimodal models enhance understanding, revealing that current models often underperform and that specific tasks benefit from improved spatial and reasoning abilities.

Contribution

The paper presents UniG2U-Bench, a new benchmark categorizing generation-to-understanding evaluation into 7 regimes and 30 subtasks, and provides extensive analysis of over 30 models.

Findings

01

Unified models often underperform compared to base VLMs.

02

Generation inference can degrade performance relative to direct inference.

03

Spatial reasoning and multi-step tasks benefit from enhanced perception and reasoning.

Abstract

Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear. Existing benchmarks lack a systematic exploration of the specific tasks where generation facilitates understanding. To this end, we introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks, requiring varying degrees of implicit or explicit visual transformations. Extensive evaluation of over 30 models reveals three core findings: 1) Unified models generally underperform their base Vision-Language Models (VLMs), and Generate-then-Answer (GtA) inference typically degrades performance relative to direct inference. 2) Consistent enhancements emerge in spatial intelligence, visual illusions, or multi-round reasoning subtasks, where enhanced spatial and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Child and Animal Learning Development · Domain Adaptation and Few-Shot Learning