TL;DR
PairBench provides a comprehensive framework to evaluate vision-language models' ability to compare images reliably, highlighting their strengths, weaknesses, and correlation with existing benchmarks.
Contribution
Introduces PairBench, a systematic and customizable evaluation framework for assessing VLMs' image comparison capabilities across multiple metrics.
Findings
No model excels across all metrics
VLMs often fail to maintain symmetric similarity scores
Performance correlates with existing benchmarks
Abstract
Understanding how effectively large vision language models (VLMs) compare visual inputs is crucial across numerous applications, yet this fundamental capability remains insufficiently assessed. While VLMs are increasingly deployed for tasks requiring comparative judgment, including automated evaluation, re-ranking, and retrieval-augmented generation, no systematic framework exists to measure their performance in these scenarios. We present PairBench, a simple framework that evaluates VLMs as customizable similarity tools using widely available image datasets. Our approach introduces four key metrics for reliable comparison: alignment with human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting. Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses.…
Peer Reviews
Decision·Submitted to ICLR 2026
- The four metrics considered are each well-motivated for the task of evaluating models as similarity kernels. - The authors' evaluation of a broad range of models with different sizes, different strengths and different access (open/proprietary) allows them to describe interesting trends in task performance. Each of the evaluations are made more robust through the inclusion of multiple prompts. - The insights regarding the lack of similarity are important, as they directly affect the cost-effect
- My main criticism, and the reason for the soundness score and the overall rating, is the nature of the transformations. The only images that can be given a high ground-truth similarity score (10 or 6, from the fixed set of scores) are identical or transformed images. In real applications, however, practitioners are likely to be computing the similarity of distinct images (including those found in Figure 16 that are not identical but could be considered similar due to them both containing birds
- The paper tackles one issue: As people are using more and more VLMS as a judge, how trustworthy it is as an automated evaluator or rankers. - The benchmark uses existing datasets with controlled transformers. They conducted a thorough evaluation of multiple VLMs. Using different prompt to evaluate was a good point as well. - The paper is easy to read.
- Having a benchmark on pairwise images and VLMs that fail to recognise it, is hardly new. [1][2] - The method merely uses basic data augmentations to create the benchmark. These do not represent what users use VLM in real life to compare or judge between samples. These are mostly synthetic or self-generated pairings. The task seems to be more of a visual robustness test. - A proper ablation study of why models fail, whether it is a vision encoder problem, or the alignment problem, is needed. Ch
Methodological Rigor in Dataset and Metric Design: The dataset is controllable and human-validated, which is derived from 3 public benchmarks (ImageNet, MS-COCO, WhatsUp) to ensure reproducibility, with pairs categorized into Identical/Transformed/Irrelevant to isolate specific visual variations. Ground truth scores are validated via 70+ annotators, avoiding subjective biases in "gold standard" labels. Insights of the correlation between VLMs’ visual comparison ability and their performance on
Disconnect between the bench’s Pair design and practical visual comparison demands: The current framework defines three Pair types (Identical/Transformed/Irrelevant) based on trivial visual manipulations (e.g., color jitter, basic rotation, Gaussian blur) or simple content overlap (e.g., near-duplicate images, random irrelevant content). However, these designs fail to capture the core real-world capabilities users actually care about: specifically, a model’s ability to (1) recognize the same obj
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
