Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models
Weixing Wang, Liudvikas Zekas, Anton Hackl, Constantin Alexander Auga, Parisa Shahabinejad, Jona Otholt, Antonio Rueda-Toicen, Gerard de Melo

TL;DR
This paper introduces XTC-Bench, a new evaluation framework for measuring semantic consistency across tasks in unified multimodal models, revealing that high task performance does not guarantee aligned representations.
Contribution
The paper proposes a scene-graph-grounded benchmarking method and a fine-grained metric to assess cross-task semantic alignment in uMMs, addressing a gap in current evaluation protocols.
Findings
High performance in generation or understanding does not ensure strong cross-task alignment.
Model architecture and how learning objectives are coupled influence consistency more than unification.
XTC-Bench is reproducible, model-agnostic, and helps diagnose representation misalignment.
Abstract
Unified Multimodal Models (uMMs) aim to support both visual understanding and visual generation within a shared representation. However, existing evaluation protocols assess these two capabilities independently and do not examine whether they are semantically aligned. As a result, it remains unclear whether current uMMs learn coherent unified representations that remain consistent across tasks given a visual concept. We introduce XTC-Bench, a scene-graph-grounded evaluation framework that measures cross-task visual semantic consistency. By deriving both generation prompts and understanding queries from a structured scene graph, our framework enables fact-level alignment analysis across objects, attributes, and relations. We propose Continuous Cross-Task Agreement (CCTA), a fine-grained metric that quantifies semantic agreement between generation and understanding over matched atomic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
