K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
Zhikai Li, Xuewen Liu, Dongrong Joe Fu, Jianquan Li, Qingyi Gu, Kurt, Keutzer, Zhen Dong

TL;DR
K-Sort Arena introduces a K-wise comparison platform for evaluating generative models, significantly improving efficiency and robustness over traditional methods by leveraging perceptual intuitiveness and probabilistic modeling.
Contribution
It presents a novel K-wise comparison framework with Bayesian updating and exploration strategies, enabling faster and more reliable benchmarking of generative models.
Findings
16.3x faster convergence than ELO algorithm
Effective incorporation of human feedback via crowdsourcing
Continuous leaderboard updates with minimal votes
Abstract
The rapid advancement of visual generative models necessitates efficient and reliable evaluation methods. Arena platform, which gathers user votes on model comparisons, can rank models with human preferences. However, traditional Arena methods, while established, require an excessive number of comparisons for ranking to converge and are vulnerable to preference noise in voting, suggesting the need for better approaches tailored to contemporary evaluation challenges. In this paper, we introduce K-Sort Arena, an efficient and reliable platform based on a key insight: images and videos possess higher perceptual intuitiveness than texts, enabling rapid evaluation of multiple samples simultaneously. Consequently, K-Sort Arena employs K-wise comparisons, allowing K models to engage in free-for-all competitions, which yield much richer information than pairwise comparisons. To enhance the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Machine Learning and Data Classification · Video Analysis and Summarization
