FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Shibo Hong; Jiahao Ying; Haiyuan Liang; Mengdi Zhang; Jun Kuang; Jiazheng Zhang; Yixin Cao

arXiv:2505.12795·cs.AI·September 30, 2025

FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

Shibo Hong, Jiahao Ying, Haiyuan Liang, Mengdi Zhang, Jun Kuang, Jiazheng Zhang, Yixin Cao

PDF

Open Access 1 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces UFEval, a unified evaluator for multimodal models that generalizes across tasks and aspects, supported by FRABench, a large-scale dataset with hierarchical aspect annotations.

Contribution

The paper presents the first unified fine-grained evaluator for multiple multimodal tasks, along with a comprehensive dataset, FRABench, enabling aspect and task generalization.

Findings

01

UFEval can generalize to unseen aspects through aspect-specific learning.

02

Joint training on multiple tasks improves evaluation performance.

03

FRABench provides extensive annotations for diverse multimodal evaluation.

Abstract

Evaluating open-ended outputs of Multimodal Large Language Models has become a bottleneck as model capabilities, task diversity, and modality rapidly expand. Existing ``MLLM-as-a-Judge'' evaluators, though promising, remain constrained to specific tasks and aspects. In this paper, we argue that, on one hand, based on the interconnected nature of aspects, learning specific aspects can generalize to unseen aspects; on the other hand, jointly learning to assess multiple visual aspects and tasks may foster a synergistic effect. To this end, we propose UFEval, the first unified fine-grained evaluator with task and aspect generalization for four evaluation tasks -- Natural Language Generation, Image Understanding, Image Generation, and Interleaved Text-and-Image Generation. However, training such a unified evaluator is hindered by the lack of a large-scale, multi-modal, and aspect-level…

Peer Reviews

Decision·ICLR 2026 Oral

Reviewer 01Rating 8Confidence 4

Strengths

- They created a comprehensive, fine-grained dataset for MLLM tasks. - The paper is well written, with detailed experiments, ablation studies, and comparisons on several public benchmarks. - They show that FRABench can be used to fine-tune smaller 7B models, improving their performance to match larger 72B+ models for MLLMs tasks - They show that UFEval can be used to automatically generate high-quality preference data.

Weaknesses

- Heavy reliance on GPT-4o annotations. It is not shown how much they correlate with human labels. - UFEval still underperform relative to the larger models.

Reviewer 02Rating 6Confidence 3

Strengths

* FRABench is a substantial contribution: 60.4k pairwise samples and 325k aspect-level labels covering 112 distinct aspects organized in a clear hierarchical taxonomy.

Weaknesses

* Vague terminology and presentation: The paper repeatedly uses the term “aspects” in the abstract and introduction without defining it or providing concrete examples. This makes it difficult for the reader to understand what exactly is being evaluated until much later (around page 4). Introducing a brief definition or illustrative examples early on would substantially improve readability and accessibility. * Although the dataset includes some human labels, most of the 325k aspect-level annotat

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper tackles a well-recognized bottleneck in MLLM research. As model capabilities expand, the need for scalable, reliable, and nuanced evaluation frameworks is paramount. The authors correctly identify the limitations of current approaches and propose a compelling, unified vision for evaluation. 2. The development of FRABench is a major contribution in its own right. Creating a benchmark of this scale (60.4k pairs, 325k labels) that is multi-task, multi-modal, and fine-grained is a sign

Weaknesses

1. This is the most significant concern regarding the methodology. A substantial portion of the 325k training labels are generated by GPT-4o. While the authors use human-annotated data for testing, the training data's quality is fundamental to the final model's performance. The paper's validity rests on the assumption that GPT-4o is a sufficiently reliable and unbiased annotator for 112 diverse aspects. 2. While the resulting aspect tree is a strength, the process of its creation is described so

Code & Models

Models

🤗
SPUH/UFEval
model· 3 dl· ♡ 1
3 dl♡ 1

Datasets

SPUH/FRABench
dataset· 66 dl
66 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsWeb Data Mining and Analysis · Distributed and Parallel Computing Systems