ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang

TL;DR
ROVER is a benchmark designed to evaluate reciprocal cross-modal reasoning in unified multimodal models, addressing the gap in current assessments by testing how models use one modality to guide or verify outputs in the other.
Contribution
It introduces a human-annotated benchmark with 1312 tasks for reciprocal cross-modal reasoning, spanning visual and verbal modalities, to evaluate and improve unified multimodal models.
Findings
Interleaved models outperform non-interleaved ones in cross-modal reasoning.
Cross-modal reasoning quality impacts visual generation performance.
Models excel at literal perceptual concepts but struggle with visual abstractions for symbolic reasoning.
Abstract
Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper is well structured and presented overall, with a helpful project page. - It addresses an important gap in UMMs by benchmarking and evaluating reciprocal cross-modal reasoning.
- Table 1 should include comparisons across more aspects. Additional explanations are needed in both the text and the table caption: benchmark dataset scale, whether it is for VG/TG/both, and clarifications on the multi-dimensional and hybrid evaluations and the types. - This work's emphasis on intermediate reasoning as a core signal for multimodal reasoning distinguishes it from existing benchmarks. However, the data curation process for these progressive reasoning steps is under-specified, esp
1. Importance of cross model reasoning in UMM and problem formulation in two complementary settings of verbally-augmented reasoning for visual generation (ROVER-IG) and visually-augmented reasoning for verbal generation (ROVER-TG) is interesting, useful and novel. 2. Careful dataset design into top level domains and subtasks for both ROVER-IG and ROVER-TG 3. Detailed metrics that aim to provide a holisitic understanding of the model performance in either settings. 4. Interesting analysis like co
The paper gives a good shot to cover a novel perspective but falls short in these following areas: 1. Stretch / Over claims: a) "Pg 5 section 4.1 (last para) the authors claim that gaps in reasoning process and alignment is the fundamental driver of diminished visual generation performance" but as seen for table 2, if you look at natural science or logic for instance for both closed and open source model, similar RP and align scores show great variability in RV scores. b) "Pg 7 section 4.2 Mo
1. This paper is generally well-written and easy to follow, with clearly illustrated figures. 2. ROVER covers a wide range of both language-reasoning tasks and visual-reasoning tasks, and uses a comprehensive evaluation method (VLM + expert validation) to ensure reliability. 3. The authors evaluate 17 unified multimodal models and provide insightful findings.
1. The benchmark heavily depends on a "VLM-as-a-judge" for scoring complex reasoning qualities. The paper's own user study (Figure 8) shows that while correlation is good, there are noticeable discrepancies, especially for reasoning-related metrics. This introduces a potential bias, where the benchmark might favor models whose outputs align with the judging VLM's own reasoning patterns. 2. As listed in Table 3, language-only models often match or exceed the performance of unified models on reaso
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Language, Metaphor, and Cognition
