HueManity: Probing Fine-Grained Visual Perception in MLLMs
Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

TL;DR
HueManity introduces a new benchmark with Ishihara-style images to evaluate fine-grained visual perception in multimodal large language models, revealing significant perceptual weaknesses not captured by existing benchmarks.
Contribution
The paper presents HueManity, a scalable automated benchmark for assessing detailed visual perception in MLLMs using Ishihara-style images, exposing their perceptual limitations.
Findings
State-of-the-art MLLMs perform poorly on fine-grained pattern recognition tasks.
Humans and fine-tuned ResNet-50 achieve near-ceiling accuracy on the benchmark.
The results reveal a critical perceptual weakness in MLLMs overlooked by traditional benchmarks.
Abstract
Recent Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning on tasks such as visual question answering and image captioning. Yet existing benchmarks largely overlook their ability to capture fine-grained perceptual details. As MLLMs are increasingly deployed in safety and reliability critical settings, perceptual acuity becomes essential. We present HueManity, a scalable automated benchmark for assessing fine-grained visual perception in MLLMs. HueManity comprises 83,850 Ishihara-style images embedding alphanumeric strings, designed to evaluate pattern recognition, a core aspect of visual understanding. Our evaluation of nine state-of-the-art MLLMs uncovers a striking performance deficit: the strongest model achieved only 33.6% accuracy on a simple numeric task and 3% on a harder alphanumeric task, compared to near-ceiling performance from humans…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This paper reveals an intriguing blind spot of current MLLMs and highlights a new axis for robustness research; the observation may catalyze broader studies on rare or specially structured imagery.
This paper does not investigate whether lightweight fine-tuning (rather than mere in-context learning) can already lift MLLM accuracy to near-human levels. If the deficit can be erased with a few gradient steps, the issue—and the accompanying dataset—may merit only limited attention.
1. The paper is well written and clearly structured, making it easy to follow. 2. It evaluates several state-of-the-art MLLMs, including GPT-4.1, Claude 3.7 Sonnet, Qwen-VL Max, LLaVA-v1.6, and Pixtral, across two tasks: the Number Recognition Task and the Alphanumeric Recognition Task. 3. The work provides a comparative analysis with existing MLLM benchmarks. However, some key benchmarks (e.g., MMVP [1], MERLIM [2] and MME [3]) are missing from the evaluation. [1] Eyes Wide Shut? Exploring th
1. The paper mainly reports a failure case of existing models but offers no new theoretical insights. Prior work such as Eyes Wide Shut [1] and MERLIM [2] has already shown that the visual backbones of MLLMs fail to capture fine-grained visual details. 2. HueManity measures only color-based figure–ground discrimination under a single visual structure (Ishihara-style dots). While the idea is well motivated, it represents only a narrow and somewhat artificial subset of visual examples for evaluati
* This paper identifies a critical deficiency in modern MLLMs: their surprisingly weak performance in fine-grained visual perception, despite strong performance on higher-level vision-language tasks. * The proposed benchmark, HueManity, is well-designed and presents a valuable resource for the community. It can be widely used in future work to evaluate and diagnose the fine-grained visual understanding capabilities of MLLMs. * The authors conduct comprehensive experiments demonstrating that ev
* This work focuses on a single aspect of visual understanding—recognizing characters in color-patterned images—which is relatively narrow compared to existing MLLM benchmarks. Modern benchmarks typically evaluate multiple capabilities, including low-level perception, high-level reasoning, OCR, and knowledge integration. While this task presents a challenging variant of OCR, the scope of the benchmark is limited in covering the broader spectrum of multimodal understanding expected from MLLMs. *
The paper is well-written and easy to follow. The benchmark is novel and does reflect a striking limitation in MLLMs’ visual perception, which pushes against the misconception that MLLMs’ can outperform humans in all simple visual tasks. The paper also considered several MLLMs, both commercial and open-source, which increases its value as a benchmark for future MLLM development.
1. The paper misses several related papers that study the ability of MLLMs on perceiving visual details [1, 2, 3, 4], and thus does not properly place its findings in the context of other existing evidence to clarify novelty and relevance. 2. Text recognition datasets (eg, TextVQA) measure the same capability that this paper tries to measure: how well can MLLMs read text in various visual settings. Given that TextVQA contains extensive variations of text and background in real world settings, i
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Subtitles and Audiovisual Media
