UEval: A Benchmark for Unified Multimodal Generation

Bo Li; Yida Yin; Wenhao Chai; Xingyu Fu; Zhuang Liu

arXiv:2601.22155·cs.CV·January 30, 2026

UEval: A Benchmark for Unified Multimodal Generation

Bo Li, Yida Yin, Wenhao Chai, Xingyu Fu, Zhuang Liu

PDF

Open Access 1 Datasets

TL;DR

UEval is a comprehensive benchmark designed to evaluate the performance of unified multimodal models capable of generating both images and text, using a detailed rubric-based scoring system for nuanced assessment.

Contribution

This paper introduces UEval, a novel benchmark with expert-validated rubrics for scalable, fine-grained evaluation of multimodal models across diverse reasoning tasks.

Findings

01

Current models struggle with the benchmark, scoring below 70.

02

Reasoning capabilities improve model performance significantly.

03

Transferring reasoning traces enhances non-reasoning models' scores.

Abstract

We introduce UEval, a benchmark to evaluate unified models, i.e., models capable of generating both images and text. UEval comprises 1,000 expert-curated questions that require both images and text in the model output, sourced from 8 real-world tasks. Our curated questions cover a wide range of reasoning types, from step-by-step guides to textbook explanations. Evaluating open-ended multimodal generation is non-trivial, as simple LLM-as-a-judge methods can miss the subtleties. Different from previous works that rely on multimodal Large Language Models (MLLMs) to rate image quality or text accuracy, we design a rubric-based scoring system in UEval. For each question, reference images and text answers are provided to a MLLM to generate an initial rubric, consisting of multiple evaluation criteria, and human experts then refine and validate these rubrics. In total, UEval contains 10,417…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

zlab-princeton/UEval
dataset· 73 dl
73 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Topic Modeling