HueManity: Probing Fine-Grained Visual Perception in MLLMs

Rynaa Grover; Jayant Sravan Tamarapalli; Sahiti Yerramilli; Nilay Pande

arXiv:2506.03194·cs.CV·February 3, 2026

HueManity: Probing Fine-Grained Visual Perception in MLLMs

Rynaa Grover, Jayant Sravan Tamarapalli, Sahiti Yerramilli, Nilay Pande

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

HueManity introduces a new benchmark with Ishihara-style images to evaluate fine-grained visual perception in multimodal large language models, revealing significant perceptual weaknesses not captured by existing benchmarks.

Contribution

The paper presents HueManity, a scalable automated benchmark for assessing detailed visual perception in MLLMs using Ishihara-style images, exposing their perceptual limitations.

Findings

01

State-of-the-art MLLMs perform poorly on fine-grained pattern recognition tasks.

02

Humans and fine-tuned ResNet-50 achieve near-ceiling accuracy on the benchmark.

03

The results reveal a critical perceptual weakness in MLLMs overlooked by traditional benchmarks.

Abstract

Recent Multimodal Large Language Models (MLLMs) demonstrate strong high-level visual reasoning on tasks such as visual question answering and image captioning. Yet existing benchmarks largely overlook their ability to capture fine-grained perceptual details. As MLLMs are increasingly deployed in safety and reliability critical settings, perceptual acuity becomes essential. We present HueManity, a scalable automated benchmark for assessing fine-grained visual perception in MLLMs. HueManity comprises 83,850 Ishihara-style images embedding alphanumeric strings, designed to evaluate pattern recognition, a core aspect of visual understanding. Our evaluation of nine state-of-the-art MLLMs uncovers a striking performance deficit: the strongest model achieved only 33.6% accuracy on a simple numeric task and 3% on a harder alphanumeric task, compared to near-ceiling performance from humans…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

This paper reveals an intriguing blind spot of current MLLMs and highlights a new axis for robustness research; the observation may catalyze broader studies on rare or specially structured imagery.

Weaknesses

This paper does not investigate whether lightweight fine-tuning (rather than mere in-context learning) can already lift MLLM accuracy to near-human levels. If the deficit can be erased with a few gradient steps, the issue—and the accompanying dataset—may merit only limited attention.

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper is well written and clearly structured, making it easy to follow. 2. It evaluates several state-of-the-art MLLMs, including GPT-4.1, Claude 3.7 Sonnet, Qwen-VL Max, LLaVA-v1.6, and Pixtral, across two tasks: the Number Recognition Task and the Alphanumeric Recognition Task. 3. The work provides a comparative analysis with existing MLLM benchmarks. However, some key benchmarks (e.g., MMVP [1], MERLIM [2] and MME [3]) are missing from the evaluation. [1] Eyes Wide Shut? Exploring th

Weaknesses

1. The paper mainly reports a failure case of existing models but offers no new theoretical insights. Prior work such as Eyes Wide Shut [1] and MERLIM [2] has already shown that the visual backbones of MLLMs fail to capture fine-grained visual details. 2. HueManity measures only color-based figure–ground discrimination under a single visual structure (Ishihara-style dots). While the idea is well motivated, it represents only a narrow and somewhat artificial subset of visual examples for evaluati

Reviewer 03Rating 4Confidence 2

Strengths

* This paper identifies a critical deficiency in modern MLLMs: their surprisingly weak performance in fine-grained visual perception, despite strong performance on higher-level vision-language tasks. * The proposed benchmark, HueManity, is well-designed and presents a valuable resource for the community. It can be widely used in future work to evaluate and diagnose the fine-grained visual understanding capabilities of MLLMs. * The authors conduct comprehensive experiments demonstrating that ev

Weaknesses

* This work focuses on a single aspect of visual understanding—recognizing characters in color-patterned images—which is relatively narrow compared to existing MLLM benchmarks. Modern benchmarks typically evaluate multiple capabilities, including low-level perception, high-level reasoning, OCR, and knowledge integration. While this task presents a challenging variant of OCR, the scope of the benchmark is limited in covering the broader spectrum of multimodal understanding expected from MLLMs. *

Reviewer 04Rating 4Confidence 4

Strengths

The paper is well-written and easy to follow. The benchmark is novel and does reflect a striking limitation in MLLMs’ visual perception, which pushes against the misconception that MLLMs’ can outperform humans in all simple visual tasks. The paper also considered several MLLMs, both commercial and open-source, which increases its value as a benchmark for future MLLM development.

Weaknesses

1. The paper misses several related papers that study the ability of MLLMs on perceiving visual details [1, 2, 3, 4], and thus does not properly place its findings in the context of other existing evidence to clarify novelty and relevance. 2. Text recognition datasets (eg, TextVQA) measure the same capability that this paper tries to measure: how well can MLLMs read text in various visual settings. Given that TextVQA contains extensive variations of text and background in real world settings, i

Code & Models

Datasets

Jayant-Sravan/HueManity
dataset· 46 dl
46 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Subtitles and Audiovisual Media