K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models   via K-wise Human Preferences

Zhikai Li; Xuewen Liu; Dongrong Joe Fu; Jianquan Li; Qingyi Gu; Kurt; Keutzer; Zhen Dong

arXiv:2408.14468·cs.AI·March 18, 2025

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences

Zhikai Li, Xuewen Liu, Dongrong Joe Fu, Jianquan Li, Qingyi Gu, Kurt, Keutzer, Zhen Dong

PDF

Open Access

TL;DR

K-Sort Arena introduces a K-wise comparison platform for evaluating generative models, significantly improving efficiency and robustness over traditional methods by leveraging perceptual intuitiveness and probabilistic modeling.

Contribution

It presents a novel K-wise comparison framework with Bayesian updating and exploration strategies, enabling faster and more reliable benchmarking of generative models.

Findings

01

16.3x faster convergence than ELO algorithm

02

Effective incorporation of human feedback via crowdsourcing

03

Continuous leaderboard updates with minimal votes

Abstract

The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Machine Learning and Data Classification · Video Analysis and Summarization