TL;DR
This paper introduces OddGridBench, a benchmark for evaluating fine-grained visual discrepancy detection in multimodal large language models, revealing current models' limitations and proposing a reinforcement learning framework to improve their perceptual sensitivity.
Contribution
The work presents a new benchmark dataset and a reinforcement learning method to enhance the visual discrepancy sensitivity of multimodal large language models.
Findings
MLLMs perform far below human levels in visual discrepancy detection.
OddGrid-GRPO significantly improves models' fine-grained visual discrimination.
Code and dataset are publicly available at the provided URL.
Abstract
Multimodal large language models (MLLMs) have achieved remarkable performance across a wide range of vision language tasks. However, their ability in low-level visual perception, particularly in detecting fine-grained visual discrepancies, remains underexplored and lacks systematic analysis. In this work, we introduce OddGridBench, a controllable benchmark for evaluating the visual discrepancy sensitivity of MLLMs. OddGridBench comprises over 1,400 grid-based images, where a single element differs from all others by one or multiple visual attributes such as color, size, rotation, or position. Experiments reveal that all evaluated MLLMs, including open-source families such as Qwen3-VL and InternVL3.5, and proprietary systems like Gemini-2.5-Pro and GPT-5, perform far below human levels in visual discrepancy detection. We further propose OddGrid-GRPO, a reinforcement learning framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
