UEval: A Benchmark for Unified Multimodal Generation
Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu

TL;DR
UEval is a comprehensive benchmark designed to evaluate the performance of unified multimodal models capable of generating both images and text, using a detailed rubric-based scoring system for nuanced assessment.
Contribution
This paper introduces UEval, a novel benchmark with expert-validated rubrics for scalable, fine-grained evaluation of multimodal models across diverse reasoning tasks.
Findings
Current models struggle with the benchmark, scoring below 70.
Reasoning capabilities improve model performance significantly.
Transferring reasoning traces enhances non-reasoning models' scores.
Abstract
We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling
