Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems

Miko{\l}aj Ma{\l}ki\'nski; Szymon Pawlonka; Jacek Ma\'ndziuk

arXiv:2411.01173·cs.AI·June 24, 2025·2 cites

Reasoning Limitations of Multimodal Large Language Models. A Case Study of Bongard Problems

Miko{\l}aj Ma{\l}ki\'nski, Szymon Pawlonka, Jacek Ma\'ndziuk

PDF

Open Access

TL;DR

This paper evaluates the reasoning capabilities of multimodal large language models on Bongard Problems, revealing their limitations in synthetic visual reasoning despite some success on real-world datasets.

Contribution

It introduces Bongard-RWR, a new dataset with real-world images for synthetic BPs, and systematically analyzes MLLMs' reasoning limitations across diverse datasets.

Findings

01

MLLMs perform poorly on synthetic Bongard Problems

02

Weak performance is due to general visual reasoning limitations

03

Some success observed on real-world datasets

Abstract

Abstract visual reasoning (AVR) involves discovering shared concepts across images through analogy, akin to solving IQ test problems. Bongard Problems (BPs) remain a key challenge in AVR, requiring both visual reasoning and verbal description. We investigate whether multimodal large language models (MLLMs) can solve BPs by formulating a set of diverse MLLM-suited solution strategies and testing $4$ proprietary and $4$ open-access models on $3$ BP datasets featuring synthetic (classic BPs) and real-world (Bongard HOI and Bongard-OpenWorld) images. Despite some successes on real-world datasets, MLLMs struggle with synthetic BPs. To explore this gap, we introduce Bongard-RWR, a dataset representing synthetic BP concepts using real-world images. Our findings suggest that weak MLLM performance on classical BPs is not due to the domain specificity, but rather comes from their general AVR…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Position-Wise Feed-Forward Layer · Adam · Multi-Head Attention · Residual Connection · Byte Pair Encoding · Dropout · Absolute Position Encodings