K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge

Zhikai Li; Jiatong Li; Xuewen Liu; Wangbo Zhao; Pan Du; Kaicheng Zhou; Qingyi Gu; Yang You; Zhen Dong; Kurt Keutzer

arXiv:2602.09411·cs.CV·February 11, 2026

K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge

Zhikai Li, Jiatong Li, Xuewen Liu, Wangbo Zhao, Pan Du, Kaicheng Zhou, Qingyi Gu, Yang You, Zhen Dong, Kurt Keutzer

PDF

Open Access 3 Reviews

TL;DR

K-Sort Eval introduces a scalable, reliable, and efficient VLM-based framework for evaluating visual generative models by integrating posterior correction and dynamic matching, reducing the need for extensive human or model comparisons.

Contribution

It presents a novel evaluation framework that combines posterior correction and dynamic matching to improve alignment and efficiency in VLM-based model evaluation.

Findings

01

Achieves evaluation results consistent with human-based assessments

02

Requires fewer than 90 model comparisons on average

03

Demonstrates high reliability and efficiency in model evaluation

Abstract

The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper addresses a real scalability challenge with Arena-style human evaluations, which are costly and time-consuming. The proposed VLM-as-judge approach is timely and practical. 2. The posterior correction method (Section 3.2) is theoretically grounded in Bayesian inference, with formal derivations provided. The treatment of VLM-human misalignment as observation noise is elegant and well-justified through Lemma 1. 3. The paper demonstrates consistency with K-Sort Arena across multiple mod

Weaknesses

1. The K-sort Eval still an existing K-Sort Arena data as supervision, which limits where this can be applied and cannot produce a score independently. Therefore, the quality of the evaluation also depends on the quality of K-sort arena data. 2. The evaluation setting more clear about how they define the evaluation task. The K-sort eval selects a model already in the K-sort arena and then predicts its rank using exisitng information from the K-sort arena, there might be information leaking durin

Reviewer 02Rating 8Confidence 5

Strengths

S1) This paper tickles the pain-point that Its difficult to evaluating visual generative models both reliably and efficiently. Human annotations are highly accurate but expensive and infeasible to scale for every model update or new dataset. Meanwhile, fast automatic scoring using VLMs is scalable, but suffers from biases. K-Sort Eval aims to deliver a solution that preserves human-grade reliability while radically improving cost-efficiency and scalability by statistically correcting VLM judgmen

Weaknesses

W1) When this is applied in cases where there are not enough votings, the rankings might not be fully reflecting the performances. Will such behavior propagate to the new models? Also would it be possible to find an estimate on how much collected human/ VLM annotated data are enough to make sure that K-sort eval will works effectively?

Reviewer 03Rating 4Confidence 4

Strengths

1. Reliable VLM Correction: The core strength is the posterior correction method. By modeling the VLM-human preference gap as observation noise and correcting the Bayesian update , the framework significantly improves alignment with human judgments . 2. Strong Empirical Validation: The method demonstrates high consistency with the human-voted K-Sort Arena leaderboard on both image and video tasks. Its utility is also shown in practical use cases, like evaluating compressed models .

Weaknesses

1. Simplified Noise Model: Assumption 1—that VLM noise is statistically independent of the true model capability—is a strong simplification. It's plausible that VLM biases are systematic (e.g., favoring certain aesthetics or failing to spot specific artifacts), which this noise model would not capture. 2. Incomplete Literature Review: The "Large Model as a Judge" related work section overlooks several recent and relevant works exploring VLMs as human-aligned evaluators. For instance, it misses a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Mobile Crowdsensing and Crowdsourcing · Generative Adversarial Networks and Image Synthesis