VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?
Minkyu Kim, Sangheon Lee, Dongmin Park

TL;DR
VLM-SubtleBench is a new benchmark for evaluating vision-language models on subtle, real-world image differences across various domains, revealing significant gaps compared to human reasoning.
Contribution
The paper introduces VLM-SubtleBench, a comprehensive benchmark for subtle comparative reasoning in diverse domains, and provides extensive evaluations exposing current model limitations.
Findings
VLMs underperform humans on subtle differences
Performance varies significantly across difference types and domains
Identifies key areas for improving VLM reasoning capabilities
Abstract
The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial,…
Peer Reviews
Decision·ICLR 2026 Poster
- In this paper, they introduced an increasingly relevant capability: subtle visual comparison between images, across multiple domains. It contains various difference types (attribute, temporal, viewpoint, etc.) and datasets beyond natural images (industrial, medical, etc.). - It also has a mix of real data and synthetic setups, to show controlled evaluation capabilities. - The paper is easy to read, though the figures could use more text to make them clearer.
- The dataset seems to be largely a small increment of prior multi-image VLM benchmarks like MLLM-CompBench, ReMI. The claim of novelty in subtlety is only partially convincing. Subtle differences are defined via embedding cosine similarity (DINOv3), but this does not necessarily guarantee perceptual or semantic subtlety. - In Figure 3, it can be seen that when the catcher moved, the other player moved as well. As the paper claims to be fine-grained, I would be interested to know how the author
1. The paper focuses on an interesting problem setup of fine-grained changes between two images, and it is interesting how current frontier models struggle at these tasks. 2. The paper adequately describes the dataset construction process, model evaluation setup, and experimental results. Overall, it is well written. 3. The evaluation process studies multiple factors that can influence model performance -- such as how to combine the two images when feeding images, different prompting strategies
1. This task of subtle difference changes b/w two images has been previously explored in works such as Spot-the-Diff [1], Img-Diff [2] and MLLM-CompBench [3] as noted by authors. The primary novelty seems to be expansion to multiple domains, more question types and combination of multiple choice questions and captioning in a single benchmark. In this regard, novelty is a bit limited. 2. There can be further baselines/prompting strategies considered such as: - Calculating regions of interest fro
- Substantive contribution via task definition & data curation. Clearly formalizing subtle comparative reasoning as an evaluation target and curating a benchmark dataset with transparent collection/validation protocols is, by itself, a meaningful research contribution. - Breadth + diagnostics. Coverage of 10 difference types and 6 domains with controlled synthetic factors (e.g., brightness deltas, object size, translation, object count) supports failure-mode analysis rather than aggregate score
- Data generation dependencies. Some Attribute pairs are created with Gemini-2.5 flash image preview (“nano-banana”) editing; Medical questions are refined by gpt-4o. This can introduce stylistic artifacts or distribution shifts that confound evaluation unless carefully audited. Please quantify any such effects (e.g., edited vs. non-edited subsets). - The paper identifies notable gaps in temporal/spatial/viewpoint (e.g., stable accuracy requiring ~160 px camera translation in synthetic tests),
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
