VRIQ: Benchmarking and Analyzing Visual-Reasoning IQ of VLMs
Tina Khezresmaeilzadeh, Jike Zhong, Konstantinos Psounis

TL;DR
VRIQ is a new benchmark that evaluates the visual reasoning IQ of VLMs, revealing significant perception-related weaknesses and limited improvements from tool augmentation, highlighting the need for better perception and reasoning integration.
Contribution
Introduces VRIQ, a comprehensive benchmark with diagnostic probes to analyze visual reasoning abilities and perception issues in VLMs, providing insights into their limitations.
Findings
VLMs perform poorly on abstract puzzles with ~28% accuracy.
Natural image reasoning yields ~45% accuracy, still weak.
Perception errors account for 56% of failures, reasoning errors only 1%.
Abstract
Recent progress in Vision Language Models (VLMs) has raised the question of whether they can reliably perform nonverbal reasoning. To this end, we introduce VRIQ (Visual Reasoning IQ), a novel benchmark designed to assess and analyze the visual reasoning ability of VLMs. We evaluate models on two sets of tasks: abstract puzzle-style and natural-image reasoning tasks. We find that on abstract puzzles, performance remains near random with an average accuracy of around 28%, while natural tasks yield better but still weak results with 45% accuracy. We also find that tool-augmented reasoning demonstrates only modest improvements. To uncover the source of this weakness, we introduce diagnostic probes targeting perception and reasoning. Our analysis demonstrates that around 56% of failures arise from perception alone, 43% from both perception and reasoning, and only a mere 1% from reasoning…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. As VLMs become more capable, it is crucial to discern whether they possess genuine reasoning abilities or merely rely on superficial statistical features from training data (shortcut learning). This paper directly addresses this core issue. 2. The conclusion that "the bottleneck is perception, not reasoning," while based on this specific dataset, is highly insightful. It points to a clear direction for future VLM improvements (i.e., strengthening foundational visual perception, especially sp
1. The entire benchmark (VRIQ) contains only N=440 samples. In the fine-grained breakdown across 5 categories and 2 domains, aome sub-categories have an extremely small number of samples (e.g., Table 1 shows that each of the 5 "natural" domain categories has only 20 questions). 2. Due to the small sample size, the statistical reliability of the results is low. On a test set with only 20 questions, getting one or two more questions right or wrong causes a large fluctuation in accuracy (5%-10%).
The probe study is valuable. The perception-vs-reasoning decomposition and finer probe tags (count, position, rotation, 3D, etc.) are a useful analysis lens beyond aggregate accuracy.
- Small sample size. The core benchmark (440 items) limits statistical power. In addition, category counts are unbalanced between abstract and nature samples. - The images are mix of open repositories and model-generated images, but sourcing, licenses are not described in the paper. Also how the images are generated using models are not described. - The process for creating and validating data samples is not documented. - Claims about “thinking with images” and tool use lack a transparent h
1. The paper is clearly written and organized. 2. The proposed three-tier evaluation framework decouples perception and reasoning as an integrated capability. It precisely attributes model failures to three categories: perception errors only, reasoning errors only and errors in both. 3. It simultaneously evaluates open-source and proprietary models (such as Qwen, GPT-4o, etc.), making it quite comprehensive. The result analysis is in-depth, with an appropriate combination of quantitative and qua
1. Although the authors claim that VRIQ is the first to conduct a parallel comparison in the abstract-natural dual domain, this setup is similar to MLLM IQ benchmarks such as MMIQ and MARVEL. It is hard to clearly demonstrate the unique advantages of VRIQ in terms of diagnostic accuracy or research inspiration. 2. The VRIQ benchmark contains a total of 440 questions, and the sample size of some reasoning categories is quite small. Small samples may lead to contingency in evaluation results. 3. F
1. Clarity: The paper has a logical flow—starting with the motivation of evaluating nonverbal reasoning in VLMs, followed by benchmark design, evaluation framework, experiments, and analysis. 2. Originality: The paper fills a gap in existing visual reasoning benchmarks by constructing parallel abstract-natural task families with identical logical structures, enabling controlled comparison of VLMs’ performance across symbolic and semantics-grounded reasoning. The hierarchical diagnostic probe fr
- Limited benchmark scale: With only 440 total questions, the sample size is small for robust statistical analysis, especially when evaluating fine-grained perception categories. This may lead to unstable accuracy estimates, particularly for tasks with low model performance near random guess. - Inadequate model coverage: The evaluated models lack diversity in two aspects: (1) No large-scale open-source VLMs (e.g., InternVL3-26B, Qwen2.5-VL-14B) are included, making it hard to assess the impact o
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Language, Metaphor, and Cognition
