Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

Yiran Zhao; Lu Zhou; Xiaogang Xu; Zhe Liu; Jiafei Wu; Liming Fang

arXiv:2603.00590·cs.AI·March 3, 2026

Fair in Mind, Fair in Action? A Synchronous Benchmark for Understanding and Generation in UMLLMs

Yiran Zhao, Lu Zhou, Xiaogang Xu, Zhe Liu, Jiafei Wu, Liming Fang

PDF

Open Access 3 Reviews

TL;DR

The paper introduces the IRIS Benchmark, a comprehensive tool for evaluating fairness in understanding and generation tasks of UMLLMs, addressing the fragmentation of fairness metrics and providing diagnostics for systemic biases.

Contribution

It presents the first synchronized fairness benchmark for UMLLMs, integrating diverse metrics into a high-dimensional fairness space and offering diagnostics to improve fairness in multimodal models.

Findings

01

Uncovered systemic biases like the 'generation gap' and 'personality splits' in UMLLMs.

02

Demonstrated the benchmark's ability to diagnose fairness issues across models.

03

Showcased the extensibility of the IRIS framework for evolving fairness metrics.

Abstract

As artificial intelligence (AI) is increasingly deployed across domains, ensuring fairness has become a core challenge. However, the field faces a "Tower of Babel'' dilemma: fairness metrics abound, yet their underlying philosophical assumptions often conflict, hindering unified paradigms-particularly in unified Multimodal Large Language Models (UMLLMs), where biases propagate systemically across tasks. To address this, we introduce the IRIS Benchmark, to our knowledge the first benchmark designed to synchronously evaluate the fairness of both understanding and generation tasks in UMLLMs. Enabled by our demographic classifier, ARES, and four supporting large-scale datasets, the benchmark is designed to normalize and aggregate arbitrary metrics into a high-dimensional "fairness space'', integrating 60 granular metrics across three dimensions-Ideal Fairness, Real-world Fidelity, and Bias…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The benchmark is a valuable contribution to the community and demonstrates that you can combine numerous, sometimes conflicting, fairness metrics into a single benchmark for a holistic view of the fairness of UMLLMs. 2. The benchmark analyses both understanding and generative capabilities of the models. 3. The authors demonstrate effectively the utility of the benchmark for determining characteristic of models and phenomena that might be present. 4. The authors demonstrate how you might use t

Weaknesses

1. Too much of the paper is in the Appendix. I don’t know how this can be fixed as the paper has a lot of content, but a huge proportion of the paper is in the Appendix, which was too long for me to evaluate completely. The authors should consider ways in which you can bring some important details into the main paper concisely (like the models analysed and the personality profiles). Many important assumptions are also only reported in the appendix (the simplification of gender, age, and skin

Reviewer 02Rating 8Confidence 3

Strengths

• Novel and meaningful contribution: A first benchmark to jointly assess fairness in UMLLMs. The unified fairness-space idea and the three-dimensional structure are elegant and address the “Babel Tower” problem of conflicting fairness metrics. The ARES classifier and supporting datasets make multi-dimensional fairness evaluation feasible in an automated fashion. • Clarity: The paper is easy to follow despite its scope. Figures clearly walk the reader through the pipeline, and the IRIS-MBTI pro

Weaknesses

* Conceptual trade-off: Fairness is inherently multi-dimensional, and each metric captures a distinct philosophical or statistical notion. Collapsing these / projecting them into a high-dimensional “fairness space” might blur important nuances and make interpretability harder. How the authors balance this abstraction with human-understandable fairness judgments would be worth discussing. * Metric sensitivity. The benchmark combines 60 metrics with tuned weights. It’s unclear how sensitive the

Reviewer 03Rating 4Confidence 3

Strengths

1. The authors address an important problem (the diversity of fairness metrics and their practical applicability) and target a viable strategy (a multidimensional benchmark that integrates these metrics in practical contexts). 2. The toolkit appears well-documented and transparent. 3. The authors bring up some interesting concepts, such as personality profiles and the generation gap, that could enrich the fairness literature if developed fully.

Weaknesses

1. The authors are trying to do way too much in a single paper. The core contribution is meant to be a unification of fairness measures, but there is very little on this (e.g., why this set of measures is better than alternatives). The contributions, such as unifying different notions of fairness in a common framework, are not yet developed or defended enough to constitute a major contribution to the fairness literature. 2. The paper is too verbose with many paragraphs and sentences that are jus

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEthics and Social Impacts of AI · Computational and Text Analysis Methods · Artificial Intelligence in Healthcare and Education