GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Ruihang Li; Leigang Qu; Jingxu Zhang; Dongnan Gui; Mengde Xu; Xiaosong Zhang; Han Hu; Wenjie Wang; Jiaqi Wang

arXiv:2602.06013·cs.CV·February 6, 2026

GenArena: How Can We Achieve Human-Aligned Evaluation for Visual Generation Tasks?

Ruihang Li, Leigang Qu, Jingxu Zhang, Dongnan Gui, Mengde Xu, Xiaosong Zhang, Han Hu, Wenjie Wang, Jiaqi Wang

PDF

Open Access

TL;DR

GenArena introduces a pairwise comparison framework for evaluating visual generation models, significantly improving alignment with human perception and outperforming traditional pointwise scoring methods.

Contribution

The paper proposes a novel pairwise evaluation paradigm that enhances stability and human alignment, outperforming existing scoring standards in visual generation assessment.

Findings

01

Pairwise evaluation outperforms pointwise scoring in stability and human alignment.

02

Open-source models can surpass proprietary models using GenArena.

03

Evaluation accuracy improves by over 20%, with a Spearman correlation of 0.86.

Abstract

The rapid advancement of visual generation models has outpaced traditional evaluation approaches, necessitating the adoption of Vision-Language Models as surrogate judges. In this work, we systematically investigate the reliability of the prevailing absolute pointwise scoring standard, across a wide spectrum of visual generation tasks. Our analysis reveals that this paradigm is limited due to stochastic inconsistency and poor alignment with human perception. To resolve these limitations, we introduce GenArena, a unified evaluation framework that leverages a pairwise comparison paradigm to ensure stable and human-aligned evaluation. Crucially, our experiments uncover a transformative finding that simply adopting this pairwise protocol enables off-the-shelf open-source models to outperform top-tier proprietary models. Notably, our method boosts evaluation accuracy by over 20% and achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Data Visualization and Analytics