Bongard-RWR+: Real-World Representations of Fine-Grained Concepts in Bongard Problems
Szymon Pawlonka, Miko{\l}aj Ma{\l}ki\'nski, Jacek Ma\'ndziuk

TL;DR
This paper introduces Bongard-RWR+, a large dataset of real-world-like images for abstract visual reasoning, and evaluates vision-language models, revealing their limitations in recognizing fine-grained concepts.
Contribution
The creation of Bongard-RWR+, a significantly expanded dataset using a VLM pipeline, and an evaluation of VLMs' ability to handle fine-grained abstract concepts in Bongard problems.
Findings
VLMs recognize coarse-grained concepts effectively.
VLMs struggle with fine-grained concept discrimination.
Limitations in VLM reasoning capabilities are highlighted.
Abstract
Bongard Problems (BPs) provide a challenging testbed for abstract visual reasoning (AVR), requiring models to identify visual concepts fromjust a few examples and describe them in natural language. Early BP benchmarks featured synthetic black-and-white drawings, which might not fully capture the complexity of real-world scenes. Subsequent BP datasets employed real-world images, albeit the represented concepts are identifiable from high-level image features, reducing the task complexity. Differently, the recently released Bongard-RWR dataset aimed at representing abstract concepts formulated in the original BPs using fine-grained real-world images. Its manual construction, however, limited the dataset size to just instances, constraining evaluation robustness. In this work, we introduce Bongard-RWR+, a BP dataset composed of instances that represent original BP abstract…
Peer Reviews
Decision·ICLR 2026 Poster
- Clever use of I2T/T2T/T2I with human vetting yields large, diverse matrices that preserve fine-grained Bongard concepts rather than only coarse, object-level cues. - Binary/paired side assignments, concept selection, and free-form concept generation give a multifaceted picture of AVR. - Grayscale shows color is a distractor; more demonstrations help certain models; generated vs. real yields similar trends, supporting external validity. - Per-concept-group breakdowns (size/shape vs. conto
- Caption/augmentation quality inherits biases and blind spots of the I2T/T2T models; although human verification mitigates this, a quantitative *masking* or *counterfactual* stress test of pipeline robustness would help. - BLEU/ROUGE/CIDEr/BERTScore only weakly capture conceptual correctness and fine-grained relations; a concept-aware rubric (or human evaluation on a subset) would better reflect success/failure in CG. - Most core experiments use a small pool of open VLMs; including stronger
1. This paper introduces Bongard-RWR+, an innovative extension of the Bongard Problems, using a semi-automated pipeline with vision-language models (VLMs) to generate large-scale, fine-grained real-world images, addressing the limitations of manual datasets like Bongard-RWR. 2. The methodology is solid, ensuring image diversity and offering valuable insights through ablations on model size, color, and image diversity. 3. The paper is clear and well-structured, with effective visuals. Its signifi
1.The paper lacks a detailed exploration of how the semi-automated pipeline can be further refined for more reliable image generation. Manual filtering still plays a key role, and automated verification could improve scalability and reduce bias. 2.Although the experiments cover multiple tasks, the evaluation of fine-grained reasoning is limited. It would benefit from including more diverse models or comparing performance with human-level reasoning. 3.Lastly, a deeper analysis of errors, especial
The paper addresses a relevant problem by scaling up the Bongard-RWR dataset from 60 to 5,400 instances through a semi-automated generation pipeline, which represents a reasonable engineering contribution to the abstract visual reasoning benchmark landscape. The experimental evaluation is comprehensive, covering multiple task formulations (binary classification, multiclass selection, text generation) and including useful ablation studies on factors like model size, color versus grayscale, and nu
The paper's core limitation is that it provides dataset scaling rather than methodological innovation. The reliance on generated images raises validity concerns, especially given significant demographic bias (79.9% White figures) and the number of exclusion of original concepts due to generation failures, suggesting the approach cannot capture full abstract reasoning complexity. The experimental analysis is shallow—it confirms known VLM limitations without investigating why models fail, lacks de
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization · Machine Learning and Data Classification · AI-based Problem Solving and Planning
