EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting
Yujin Park, Haejun Chung, Ikbeom Jang

TL;DR
EZ-Sort combines zero-shot CLIP-based pre-ordering with human-in-the-loop sorting to significantly reduce annotation costs in pairwise comparison tasks, maintaining high reliability.
Contribution
It introduces a novel method that leverages CLIP for pre-ordering and automates easy comparisons, reducing human effort in pairwise ranking tasks.
Findings
Reduced human annotation cost by 90.5% compared to exhaustive methods
Achieved 19.8% cost reduction over prior work for n=100
Maintained or improved inter-rater reliability
Abstract
Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability. However, exhaustive comparisons require a massive number of annotations (O(n^2)). Recent work has greatly reduced the annotation burden (O(n log n)) by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by (1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training, and (2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP-based zero-shot pre-ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty-guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face-age estimation (FGNET), historical…
| Retina (EyePACS) | Historical (DHCI) | Face (FGNET) | ||||||||||
| Method | Sp | Ke | Pe | ICC | Sp | Ke | Pe | ICC | Sp | Ke | Pe | ICC |
| Classification | 0.53 | 0.46 | 0.54 | 0.75 | 0.39 | 0.33 | 0.42 | 0.68 | 0.92 | 0.85 | 0.94 | 0.97 |
| (±0.06) | (±0.07) | (±0.07) | (N/A) | (±0.06) | (±0.05) | (±0.06) | (N/A) | (±0.04) | (±0.09) | (±0.03) | (N/A) | |
| Sort comparison (jang2022decreasing, ) | 0.72 | 0.56 | 0.72 | 0.89 | 0.47 | 0.35 | 0.47 | 0.78 | 0.97 | 0.88 | 0.97 | 0.99 |
| (±0.07) | (±0.06) | (±0.07) | (N/A) | (±0.17) | (±0.15) | (±0.17) | (N/A) | (±0.01) | (±0.01) | (±0.01) | (N/A) | |
| EZ-Sort (CIKM) | 0.85 | 0.76 | 0.85 | 0.94 | 0.47 | 0.39 | 0.47 | 0.73 | 0.96 | 0.91 | 0.96 | 0.99 |
| (±0.09) | (±0.14) | (±0.09) | (N/A) | (±0.17) | (±0.15) | (±0.16) | (N/A) | (±0.01) | (±0.02) | (±0.01) | (N/A) | |
| Dataset size | Exhaustive comparison (thurstone1927method, ) | Sort comparison (jang2022decreasing, ) | EZ-Sort (CIKM) | |
| 435 | 126 | 89 | ||
| 1,225 | 240 | 142 | ||
| 4,950 | 582 | 467 |
| Dataset | Size | Task | Labels |
| FGNET | 1,002 | Face Age | 0–69 years (continuous) |
| DHCI | 450 | Historical Dating | 1930s–1970s (5 classes) |
| EyePACS | 28,792 | Retinal Quality | 3-level grading |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
\setcctype
by-nc-sa
EZ-Sort: Efficient Pairwise Comparison via Zero-Shot CLIP-Based Pre-Ordering and Human-in-the-Loop Sorting
Yujin Park
https://orcid.org/0009-0001-8988-5698 Hanyang UniversitySeoul04763Republic of Korea
,
Haejun Chung*
https://orcid.org/0000-0001-8959-237X Hanyang UniversitySeoul04763Republic of Korea
and
Ikbeom Jang*
https://orcid.org/0000-0002-6901-983X Hankuk University of Foreign StudiesYongin17035Republic of Korea
(2025)
Abstract.
Pairwise comparison is often favored over absolute rating or ordinal classification in subjective or difficult annotation tasks due to its improved reliability; however, exhaustive comparisons require a massive number of annotations (). Recent work (jang2022decreasing, ) greatly reduced the annotation burden () by actively sampling pairwise comparisons using a sorting algorithm. We further improve annotation efficiency by 1) roughly pre-ordering items using the Contrastive Language-Image Pre-training (CLIP) model hierarchically without training and 2) replacing easy, obvious human comparisons with automated comparisons. The proposed EZ-Sort first produces a CLIP‑based zero‑shot pre‑ordering, then initializes bucket-aware Elo scores, and finally runs an uncertainty‑guided human-in-the-loop MergeSort. Validation was conducted using various datasets: face‑age estimation (FGNET) (993553, ), historical image chronology (DHCI) (palermo2012dating, ), and retinal image quality assessment (EyePACS) (fu2019evaluation, ). It showed that EZ‑Sort reduced human annotation cost by 90.5% compared to exhaustive pairwise comparisons and by 19.8% compared to prior work (jang2022decreasing, ) (when ) while improving or maintaining inter-rater reliability. These results demonstrate that combining CLIP-based priors with uncertainty-aware sampling yields an efficient and scalable solution for pairwise ranking. Code available at https://github.com/yujinPark02/EZ-Sort-CIKM2025.
Pairwise comparison, Human-in-the-loop sorting, VLM-based pre-ordering, Annotation, Labeling
*Corresponding authors.
††journalyear: 2025††copyright: cc††conference: Proceedings of the 34th ACM International Conference on Information and Knowledge Management; November 10–14, 2025; Seoul, Republic of Korea††booktitle: Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea††doi: 10.1145/3746252.3760848††isbn: 979-8-4007-2040-6/2025/11††ccs: Information systems Data labeling††ccs: Computing methodologies Ranking
1. Introduction
Pairwise comparison is widely preferred in subjective annotation tasks, including perceptual quality assessment, face-age estimation, and medical image triage (kalpathy2016plus, ), due to its superior inter-rater reliability compared to absolute or ordinal ratings. However, exhaustive pairwise labeling incurs a quadratic annotation burden , quickly becoming infeasible as dataset sizes grow. This annotation bottleneck has tangible implications for large-scale clinical diagnostics, population health studies, and the preservation of historical records, where subject-matter expertise is scarce and annotation budgets are limited. This scalability challenge can be addressed by leveraging computational sorting algorithms, such as MergeSort.
However, traditional methods, such as the Bradley–Terry–Luce model (bradley1952rank, ) and active learning strategies (maystre2017just, ; saar2004active, ), typically assume uniform priors and overlook opportunities to leverage the existing semantic structure in the data. More recently, sorting-based approaches have incorporated active query selection to reduce the number of comparisons (jang2022decreasing, ).
To further alleviate the annotation burden, we propose incorporating vision-language models (VLM), such as CLIP (radford2021learning, ), to provide a strong starting point for the sorting process. Because such models are pre-trained on hundreds of millions of image-text pairs, proper prompts enable a coarse initial ranking of given items to be labeled. This leads to a significant reduction in the number of comparisons needed for sorting. We iteratively execute this process hierarchically to improve accuracy. We also propose replacing human comparisons with automated comparisons for item pairs with low uncertainty. By combining this prior knowledge with an uncertainty-guided comparison selection strategy, we aim to provide a complementary perspective to existing methods, focusing on the efficient utilization of both model priors and human expertise.
Our key contributions include (1) leveraging pre-trained vision-language priors to reduce the initial annotation search space significantly, (2) applying this VLM-based pre-ordering hierarchically for improved accuracy, and (3) introducing a novel uncertainty-guided sorting strategy to prioritize human annotation resources intelligently.
2. Methods
EZ-Sort follows a three-stage pipeline: (i) hierarchical CLIP prompting provides a zero-shot semantic pre-ordering (i.e., ordinal classification) of images, (ii) bucket-aware Elo scores initialize priors, and (iii) Kullback-Leibler(KL) based MergeSort routes only uncertain pairs to annotators. (budagam2024hierarchical, ; booch2021thinking, ).
This pipeline reflects a dual-process structure, with automatic, low-uncertainty comparisons (System 1) and human deliberation for high-uncertainty cases (System 2). The subsequent subsections provide a detailed overview of each stage.
2.1. Hierarchical Prompt Design for Zero-Shot Classification
Traditional classification often struggles with ambiguous subjective attributes. To address this, we introduce a hierarchical prompting strategy that exploits CLIP’s zero-shot capabilities through multi-level binary decisions. Inspired by how machine learning has optimized sorting algorithms (mankowitz2023faster, ; zhao2018n, ; bai2023sorting, ), our method enhances classification robustness by decomposing complex decisions into more tractable binary ones.
To evaluate its benefit over single-level prompting, we tested two flat baselines: a minimal template and an enhanced variant incorporating GPT-4-generated attributes . Experiments across three datasets demonstrate that hierarchical prompts outperform both baselines, yielding an improvement of up to 2.0 MAE. This performance gain arises from (1) decision decomposition, which replaces -way classification with binary steps where CLIP excels, and (2) progressive refinement, which focuses each decision on discriminative features at that granularity. By mimicking human coarse-to-fine reasoning, our hierarchical strategy enhances both interpretability and classification accuracy, particularly in visually ambiguous domains.
2.1.1. Prompt Generation Methodology
We replace flat -class classification with an adaptive hierarchical binary structure, dynamically yielding groups until no meaningful visual distinctions remain. At each level , we define , building upon recent advances in prompt learning (zhou2022learning, ; zhou2022conditional, ).
**Automated Hierarchical Prompt Generation:
”Given the domain [domain] with range [range], iteratively divide each group into two visually distinguishable sub-groups using observable anatomical or textural features. Continue dividing until no further meaningful visual distinctions are identifiable. Avoid behavioural or contextual clues.”**
For face-age estimation (domain = face, range = 0–60+ years) this produces prompts such as “rounded cheeks, large forehead” (infants) versus “defined cheekbones, mature jawline” (adults); the recursion stops once visual distinctions become ambiguous.
Adaptive Depth Criterion. Depth is not fixed; generation halts when GPT-4 signals that further splits are unreliable, resulting in 3 to 5 levels in practice, balancing granularity with annotation load.
Group Assignment. Binary outcomes are combined into a final group index
[TABLE]
where is the image-specific depth. This hierarchical encoding stabilizes Elo initialization and reduces the need for downstream human comparisons.
2.1.2. CLIP-Based Classification
For each level in the hierarchy, we compute text and image embeddings using CLIP (radford2021learning, ), and measure similarity between image and prompt as cosine similarity: . This approach leverages CLIP’s zero-shot classification capabilities (qian2024online, ) and builds upon knowledge-enhanced visual models (shen2022k, ).
Classification decisions and confidence scores are derived from these similarities: and where is a temperature parameter controlling the softness of the distribution. Figure 2 shows an example of this process at Level 1 using binary prompts for age grouping. Having established the pre-ordering and initial bucket assignments, we next describe how pairs are selected for human annotation based on uncertainty.
2.1.3. Bucket-Aware Rating Initialization
To obtain a coarse ordering for Elo initialization, the fine-grained groups produced by hierarchical classification are merged into primary buckets using the mapping defined by . This rule uniformly distributes groups while preserving their ordinal relationships. We empirically find the optimal number of primary buckets to be between 3 and 5, which balances ranking accuracy and annotation cost; the smaller merges overly dissimilar items, while larger increases unertain cross-bucket comparisons without accuracy benefits.
Accordingly, we set for CLIP-friendly domains such as FGNET but for more challenging domains like DHCI and EyePACS, thereby containing noise and comparison overhead. Each image then receives an Elo rating
[TABLE]
where is its bucket, adds controlled randomness, and the confidence term keeps high-confidence samples stable while allowing low-confidence ones to move more freely.
2.1.4. Information-Gain-Based Uncertainty and Comparison Prioritization
We assess the informativeness of comparisons using the KL-divergence from a uniform prior. For items and with Elo scores and , the pre-comparison distribution is , where , following the default setting.
The information gain is computed as:
[TABLE]
To prioritize comparisons, we define:
[TABLE]
where for cross-bucket pairs (to account for CLIP uncertainty), and otherwise; penalizes low-confidence predictions: .The uncertainty measure is its complement, normalized by the maximum binary InfoGain (): .
2.1.5. MergeSort with Uncertainty-Aware Comparison Selection
Our algorithm follows the exact comparison schedule of classical MergeSort; the only difference is how each comparison is resolved. Let be the KL-based score for items and an adaptive threshold (Sec. 2.1.5). During each merge, we apply the rule
[TABLE]
breaking ties in favor of the human query. If the condition is false, the outcome is decided automatically by , where denotes the current Elo scores.
Because (i) every pair that standard MergeSort examines is still compared, and (ii) deciding (4) takes constant time (), the overall complexity remains . Thus, EZ-Sort preserves the algorithmic optimality while redirecting human effort only to the most uncertain comparisons.
Adaptive threshold
The threshold is adapted based on the remaining budget and current accuracy:
[TABLE]
where denotes the current evaluation cycle (0-based), updated after every batch of human comparisons or every merge operation, whichever occurs first. Here, controls budget sensitivity, and encourages increased automation as accuracy improves.
3. Experiments and Results
For proof of concept, we evaluated EZ-Sort on three public datasets: FGNET for face age estimation, DHCI for historical image chronology, and EyePACS for retinal image quality. We conducted two types of experiments: (1) inter-rater reliability with expert annotations and (2) annotation efficiency benchmarking across dataset sizes that are common in expert-only scenarios (up to ). The reported improvements are statistically significant at .
Inter-rater reliability. We randomly selected 30 images per dataset and had three domain experts annotate them using absolute classification, sort comparison (jang2022decreasing, ), and EZ-Sort. Results in Table 1 show the inter-rater consistency achieved by each method. Face age estimation (FGNET) achieved uniformly high reliability across all approaches (), indicating clear visual criteria. DHCI exhibited moderate consistency (–), with pairwise methods outperforming classification. For retinal image quality (EyePACS), EZ-Sort attained the highest reliability (, Spearman ), demonstrating robustness in ambiguous medical images.
Ablation study. To validate whether hierarchical prompting provides benefit, we analyzed the correlation between the sorted output and the ground truth continuous label, age, which is only available in FGNET. Spearman correlation was 0.90 with EZ-Sort, while flat prompts (with seven class-specific prompts) showed a correlation of 0.83, indicating that hierarchical prompting yields an improvement of 8.4% through progressive refinement.
Annotation efficiency. Table2 shows that EZ-Sort required only 20.5%, 11.6%, and 9.4% of exhaustive comparisons at , , and , respectively. Compared to sort comparison (jang2022decreasing, ), this corresponds to a relative reduction in human annotation cost of 29.4%, 40.8%, and 19.8% at each respective scale. The most significant gain was observed at , suggesting that EZ-Sort benefits from CLIP pre-ordering most effectively in mid-scale annotation settings. While a slight drop at reflects increased CLIP uncertainty, our method still maintains substantial efficiency gains over both baselines. These sample sizes reflect real-world-like scenarios commonly found in medical or historical domains. Larger-scale generalization is discussed in Section 4.
Comparison method allocation. Human annotation was requested for 23.1%, 18.4%, and 31.2% of comparisons at , , and , respectively; the rest were resolved automatically using Elo predictions. This demonstrates the adaptive allocation of annotation effort while preserving the MergeSort structure (cole1988parallel, ).
Implementation details. We used CLIP ViT-B/32 with temperature . Elo ratings , linearly distributed in [1200, 1800] across buckets, noise . Adaptive threshold: , , . Priority: , . Buckets: (FGNET) and (others). Parameters were selected via cross-validation and prompts were generated with GPT-4. CLIP preprocessing requires 39 ms per image on a CPU on average, providing efficient automated pre-ordering prior to human annotation.
Human annotation cost. EZ-Sort keeps the theoretical bound: zero-shot pre-ordering is (with constant ), bucket-aware Elo is ), and the uncertainty-aware MergeSort follows the canonical ) schedule. Measured against the information-theoretic minimum ), our method uses 0.87, 0.73, and 1.01 × that bound for (); at (), we require 467 queries versus the 520-query lower limit (90% of optimal).
4. Discussion
EZ-Sort offers two primary advantages: a significant reduction in human annotation (up to 90.5%) and consistently strong inter-rater reliability across diverse domains. These benefits arise from its hybrid architecture, which combines CLIP-driven priors with uncertainty-aware comparison selection.
Our hierarchical prompting strategy outperforms flat prompting by 2.0 MAE on average, primarily due to CLIP’s strength in making binary decisions and facilitating progressive refinement. The KL-based uncertainty prioritizes ambiguous cases, effectively allocating annotation effort where model confidence is lowest.
EZ-Sort has several limitations. First, its performance depends on the reliability of the underlying vision-language model; domain-specific biases in CLIP could affect the initial ranking. Hierarchical prompting, while powerful, may struggle in domains with subtle or poorly defined visual distinctions. Finally, although our simulations suggest scalability to larger datasets, real-world performance in highly imbalanced or noisy settings remains to be validated.
We plan to integrate with annotator reliability models (whitehill2009whose, ) that down-weight noisy labels, which can be crucial in crowd-sourced annotation. Validating the scalability of our approach on larger datasets is a to-do. Additionally, we plan to investigate few-shot fine-tuning (liu2022few, ) or prompt adaptation for VLM models to reduce uncertainty in less common domains. Preliminary trials with Bayesian Elo variants (e.g., TrueSkill (herbrich2006trueskill, )) did not yield improvements; thus, exploring new sorting algorithms tailored for uncertainty-guided annotation remains a potential avenue for future direction.
Acknowledgments. This work was supported by National Research Foundation (RS-2024-00455720, RS-2024-00338048), National Institute of Health (2024ER040700, 2025ER040300), Hankuk University of Foreign Studies Research Fund of 2025, and IITP (RS-2020-II201373 Hanyang University, IITP-(2025)-RS-2023-00253914).
5. Gen-AI Usage Disclosure
No AI tools were used in algorithm development, data collection, analysis, hierarchical prompt design, or manuscript content generation. Claude assisted in code debugging and optimization, and GPT-4 provided grammar and stylistic edits for manuscript writing. All core technical and conceptual contributions are original.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Daniel J. Mankowitz, Andrea Michi, Anton Zhernov, Marco Gelmi, Marco Selvi, Cosmin Paduraru, Edouard Leurent, Shariq Iqbal, Jean-Baptiste Lespiau, Alex Ahern, et al. 2023. Faster sorting algorithms discovered using deep reinforcement learning. Nature 618, 7964 (2023), 257–263.
- 2(2) Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 9 (2022), 2337–2348.
- 3(3) Jing Li, Rafal Mantiuk, Junle Wang, Suiyi Ling, and Patrick Le Callet. 2018. Hybrid-MST: A hybrid active sampling strategy for pairwise preference aggregation. In Advances in Neural Information Processing Systems , Vol. 31.
- 4(4) Ikbeom Jang, Garrison Danley, Ken Chang, and Jayashree Kalpathy-Cramer. 2022. Decreasing annotation burden of pairwise comparisons with human-in-the-loop sorting: Application in medical image artifact rating. ar Xiv preprint ar Xiv:2202.04823 (2022).
- 5(5) Ralph Allan Bradley and Milton E. Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345.
- 6(6) Joseph L. Hansen. 1978. An Application of the Elo Rating System to Professional Baseball . Ph.D. Dissertation. Kalamazoo College.
- 7(7) Robin Swezey, Aditya Grover, Bruno Charron, and Stefano Ermon. 2021. Pirank: Scalable learning to rank via differentiable sorting. In Advances in Neural Information Processing Systems , Vol. 34, 21644–21654.
- 8(8) Xingjian Bai and Christian Coester. 2023. Sorting with predictions. In Advances in Neural Information Processing Systems , Vol. 36, 26563–26584.
