From Reasoning to Pixels: Benchmarking the Alignment Gap in Unified Multimodal Models
Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, Taylor Berg-Kirkpatrick

TL;DR
This paper introduces UReason, a benchmark for evaluating cross-modal alignment in unified multimodal models using reasoning-guided image generation, revealing current models' limitations in aligning visual semantics with textual reasoning.
Contribution
The paper presents UReason, a new benchmark and evaluation framework for assessing cross-modal alignment in UMMs through reasoning-guided image generation tasks.
Findings
Reasoning-guided generation improves over direct generation.
De-contextualized generation outperforms reasoning-guided generation.
Current UMMs do not reliably reflect visual semantics in generated images.
Abstract
Unified multimodal models (UMMs) aim to integrate multimodal understanding and generation within a unified architecture, yet it remains unclear to what extent their representations are truly aligned across modalities. To investigate this question, we use reasoning-guided image generation as a diagnostic task, where models produce textual reasoning first and then generate images. We introduce UReason, a benchmark for evaluating cross-modal alignment in this paradigm, consisting of 2,000 manually curated instances spanning five reasoning-intensive tasks: Code, Arithmetic, Spatial, Attribute and Text. To enable controlled analysis, we develop an evaluation framework that compares direct generation, reasoning-guided generation and de-contextualized generation, which conditions only on the refined prompt extracted from reasoning. Across eight widely used UMMs, while we find that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
