Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Weixing Wang; Liudvikas Zekas; Anton Hackl; Constantin Alexander Auga; Parisa Shahabinejad; Jona Otholt; Antonio Rueda-Toicen; Gerard de Melo

arXiv:2604.25072·cs.CV·April 29, 2026

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

Weixing Wang, Liudvikas Zekas, Anton Hackl, Constantin Alexander Auga, Parisa Shahabinejad, Jona Otholt, Antonio Rueda-Toicen, Gerard de Melo

PDF

TL;DR

This paper introduces XTC-Bench, a new evaluation framework for measuring semantic consistency across tasks in unified multimodal models, revealing that high task performance does not guarantee aligned representations.

Contribution

The paper proposes a scene-graph-grounded benchmarking method and a fine-grained metric to assess cross-task semantic alignment in uMMs, addressing a gap in current evaluation protocols.

Findings

01

High performance in generation or understanding does not ensure strong cross-task alignment.

02

Model architecture and how learning objectives are coupled influence consistency more than unification.

03

XTC-Bench is reproducible, model-agnostic, and helps diagnose representation misalignment.

Abstract

Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.