VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Minkyu Kim; Sangheon Lee; Dongmin Park

arXiv:2603.07888·cs.CV·March 10, 2026

VLM-SubtleBench: How Far Are VLMs from Human-Level Subtle Comparative Reasoning?

Minkyu Kim, Sangheon Lee, Dongmin Park

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

VLM-SubtleBench is a new benchmark for evaluating vision-language models on subtle, real-world image differences across various domains, revealing significant gaps compared to human reasoning.

Contribution

The paper introduces VLM-SubtleBench, a comprehensive benchmark for subtle comparative reasoning in diverse domains, and provides extensive evaluations exposing current model limitations.

Findings

01

VLMs underperform humans on subtle differences

02

Performance varies significantly across difference types and domains

03

Identifies key areas for improving VLM reasoning capabilities

Abstract

The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce VLM-SubtleBench, a benchmark designed to evaluate VLMs on subtle comparative reasoning. Our benchmark covers ten difference types - Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action - and curate paired question-image sets reflecting these fine-grained variations. Unlike prior benchmarks restricted to natural image datasets, our benchmark spans diverse domains, including industrial, aerial,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

- In this paper, they introduced an increasingly relevant capability: subtle visual comparison between images, across multiple domains. It contains various difference types (attribute, temporal, viewpoint, etc.) and datasets beyond natural images (industrial, medical, etc.). - It also has a mix of real data and synthetic setups, to show controlled evaluation capabilities. - The paper is easy to read, though the figures could use more text to make them clearer.

Weaknesses

- The dataset seems to be largely a small increment of prior multi-image VLM benchmarks like MLLM-CompBench, ReMI. The claim of novelty in subtlety is only partially convincing. Subtle differences are defined via embedding cosine similarity (DINOv3), but this does not necessarily guarantee perceptual or semantic subtlety. - In Figure 3, it can be seen that when the catcher moved, the other player moved as well. As the paper claims to be fine-grained, I would be interested to know how the author

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper focuses on an interesting problem setup of fine-grained changes between two images, and it is interesting how current frontier models struggle at these tasks. 2. The paper adequately describes the dataset construction process, model evaluation setup, and experimental results. Overall, it is well written. 3. The evaluation process studies multiple factors that can influence model performance -- such as how to combine the two images when feeding images, different prompting strategies

Weaknesses

1. This task of subtle difference changes b/w two images has been previously explored in works such as Spot-the-Diff [1], Img-Diff [2] and MLLM-CompBench [3] as noted by authors. The primary novelty seems to be expansion to multiple domains, more question types and combination of multiple choice questions and captioning in a single benchmark. In this regard, novelty is a bit limited. 2. There can be further baselines/prompting strategies considered such as: - Calculating regions of interest fro

Reviewer 03Rating 6Confidence 3

Strengths

- Substantive contribution via task definition & data curation. Clearly formalizing subtle comparative reasoning as an evaluation target and curating a benchmark dataset with transparent collection/validation protocols is, by itself, a meaningful research contribution. - Breadth + diagnostics. Coverage of 10 difference types and 6 domains with controlled synthetic factors (e.g., brightness deltas, object size, translation, object count) supports failure-mode analysis rather than aggregate score

Weaknesses

- Data generation dependencies. Some Attribute pairs are created with Gemini-2.5 flash image preview (“nano-banana”) editing; Medical questions are refined by gpt-4o. This can introduce stylistic artifacts or distribution shifts that confound evaluation unless carefully audited. Please quantify any such effects (e.g., edited vs. non-edited subsets). - The paper identifies notable gaps in temporal/spatial/viewpoint (e.g., stable accuracy requiring ~160 px camera translation in synthetic tests),

Code & Models

Datasets

KRAFTON/VLM-SubtleBench
dataset· 4.0k dl
4.0k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis