VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning
Hao Yan, Xingchen Liu, Hao Wang, Zhenbiao Cao, Handong Zheng, Liang Yin, Xinxing Su, Zihao Chen, Jihao Wu, Minghui Liao, Chao Weng, Wei Chen, Yuliang Liu, Xiang Bai

TL;DR
This paper introduces VisuRiddles, a new benchmark for abstract visual reasoning, and a data synthesis framework that enhances multimodal large language models' perception and reasoning abilities in abstract graphics.
Contribution
The paper presents a novel benchmark and a perceptual riddle synthesizer to improve MLLMs' perception and interpretability in abstract visual reasoning tasks.
Findings
Fine-grained perception is the main bottleneck in AVR.
Synthesized training data improves MLLMs' performance.
Models show better reasoning with perceptual supervision.
Abstract
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks. However, Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics. To tackle this issue, we investigate the bottlenecks in current MLLMs and synthesize training data to improve their abstract visual perception. First, we propose VisuRiddles, a benchmark for AVR, featuring tasks meticulously constructed to assess models' reasoning capacities across five core dimensions and two high-level reasoning categories. Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions. PRS not only generates valuable training data for abstract graphics but also provides fine-grained perceptual description, crucially…
Peer Reviews
Decision·ICLR 2026 Poster
I believe the authors have presented a project in high completeness. This manuscript offers clear descriptions about the motivations, the method designs, and the overall contributions.
My only (and perhaps relatively trivial) concern is regarding the generalizability of the new PAVR baseline method. Although PAVR is shown to offer a high bar for the self-made VisuRiddle benchmark, how does PAVR perform on existing AVR benchmarks? So far, the authors have only shown its performance on VisuLogic in Appendix D. But I believe PAVR needs to be further tested on alternative benchmarks regarding its consistency, such as LogicVista and/or VOILA for rigorous logic reasoning, and MathVe
1) The paper presents rigorous empirical analysis, including comprehensive comparisons with both open-source and commercial MLLMs. Ablation studies are detailed and support the key claims (e.g., perception annotations yield +42.7% improvement, RL adds +7.3% further gains). 2) The findings have broad implications for MLLM development, suggesting that improving perceptual resolution and grounding may be more impactful than simply enlarging models or adding reasoning traces.
1) While the synthesizer is innovative, the paper admits that generated riddles are “deliberately easier” and may lack the richness and noise of real-world abstract reasoning tasks. This could limit transferability. 2) Although VisuRiddles is comprehensive, stronger evidence of generalization would come from evaluating PAVR on unseen external datasets (e.g., MARVEL, PuzzleVQA, or VisLogic) 3) The “rethinking” phenomenon observed in PAVR is intriguing but underexplored. Quantitative metrics on pe
1. The paper tackles an important and timely problem - the limited fine-grained perception of MLLMs - which is of clear relevance to the ICLR community. 2. Introduction of a new benchmark (VisuRiddles) focused specifically on AVR is valuable. 3. The benchmark covers diverse aspects of perceptual reasoning (numerosity, attributeness, style, position, spatiality). 4. The paper demonstrates clear limitations of several open and proprietary model families on the benchmark. 5. The proposed method ach
1. Limited novelty of the benchmark: The benchmark largely repackages existing datasets (Chinese National Civil Service Examination, RAVEN, and Sudoku) with limited methodological innovation. Details of dataset construction are missing. For instance: 1. Were questions from the Civil Service dataset used as-is or modified? 2. How were 100 RAVEN questions selected from the 70k instances? 3. A pseudo-code description of the “Synthesis Algorithm” (Fig. 2a) would improve reproducibility.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
