Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems
Miko{\l}aj Ma{\l}ki\'nski, Szymon Pawlonka, Jacek Ma\'ndziuk

TL;DR
This paper evaluates the reasoning capabilities of multimodal large language models on Bongard Problems, revealing their limitations in synthetic visual reasoning despite some success on real-world datasets.
Contribution
It introduces Bongard-RWR, a new dataset with real-world images for synthetic BPs, and systematically analyzes MLLMs' reasoning limitations across diverse datasets.
Findings
MLLMs perform poorly on synthetic Bongard Problems
Weak performance is due to general visual reasoning limitations
Some success observed on real-world datasets
Abstract
Abstract visual reasoning (AVR) involves discovering shared concepts across images through analogy, akin to solving IQ test problems. Bongard Problems (BPs) remain a key challenge in AVR, requiring both visual reasoning and verbal description. We investigate whether multimodal large language models (MLLMs) can solve BPs by formulating a set of diverse MLLM-suited solution strategies and testing proprietary and open-access models on BP datasets featuring synthetic (classic BPs) and real-world (Bongard HOI and Bongard-OpenWorld) images. Despite some successes on real-world datasets, MLLMs struggle with synthetic BPs. To explore this gap, we introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images. Our findings suggest that weak MLLM performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings
