What Makes a Maze Look Like a Maze?
Joy Hsu, Jiayuan Mao, Joshua B. Tenenbaum, Noah D. Goodman, Jiajun Wu

TL;DR
This paper introduces Deep Schema Grounding (DSG), a framework that enhances vision-language models' ability to interpret and reason about abstract visual concepts like mazes by grounding schemas in images.
Contribution
The paper proposes DSG, a novel method that uses schemas and large language models to improve understanding of visual abstractions in images.
Findings
DSG significantly improves abstract visual reasoning performance.
Grounded schemas enhance understanding of complex visual concepts.
The approach outperforms existing models on the Visual Abstractions Dataset.
Abstract
A unique aspect of human visual understanding is the ability to flexibly interpret abstract concepts: acquiring lifted rules explaining what they symbolize, grounding them across familiar and unfamiliar contexts, and making predictions or reasoning about them. While off-the-shelf vision-language models excel at making literal interpretations of images (e.g., recognizing object categories such as tree branches), they still struggle to make sense of such visual abstractions (e.g., how an arrangement of tree branches may form the walls of a maze). To address this challenge, we introduce Deep Schema Grounding (DSG), a framework that leverages explicit structured representations of visual abstractions for grounding and reasoning. At the core of DSG are schemas--dependency graph descriptions of abstract concepts that decompose them into more primitive-level symbols. DSG uses large language…
Peer Reviews
Decision·ICLR 2025 Poster
- Innovative use of pre-trained LLMs and VLMs to give additional context (in terms of grounded schemas) for VQA. In principle, this additional information should help answer pre-trained VLMs to better answer the questions related to the abstract concepts in the image. (However I think that the generated schemas shown in the appendix of the paper are not detailed enough to achieve this - please check weakness section) - Hierarchical Grounding: I like how they have used a hierarchical method of g
- Schemas are not detailed enough to give information about the concept. For example, tic-tac-toe schema include {board, symbols, strategy}. Although a tic-tac-toe game has {board, symbols, strategy} it is not complete. Many board games have {board, symbols, strategy}. This schema does not tell that the board is a 3x3 grid. Similarly, for “negotiating” the schema is {participants, setting, object}. This schema could be of many different settings. (p.s. I do understand that the capabilities of c
The paper provides a good solution for VLMs to better understand abstract concepts in an image. The presentation of the paper is clear with many figures for demonstration. The experiments are comprehensive, making the main message convincing. The benchmark created is novel and interesting.
In terms of the idea behind DSG, it seems to be close to chain of thought with specific instructions. For example, the maze example in the paper can be integrated with just one prompt: "Imagine that the image represents a maze. <the question> Think step by step by recognizing the layout of the maze, the walls of the maze, then the entry and exit of the maze one by one." Maybe gpt-4o can automatically do this even without the instructions. This is probably why in Table 7 if the generation is free
This paper is extremely clear and straightforward. In general, I believe the comparisons are fair and sufficient ablations are provided, so it is methodologically sound. For example, they show a 10% relative improvement when applying their method with GPT-4o and 7% with (the open-source and weaker) LLaVA model. They also compare against other methods (ViperGPT and VisProg) that also use GPT models (GPT-3) for visual question decomposition, but don't perform nearly as well on this benchmark. Fin
I believe the construction of the dataset is under-specified (and I also checked the appendix). I understand that answer annotations were provided by crowd workers. However, how were the questions written? How were the images selected (and where do they come from: L376 just says "the Internet")? Likewise for the 12 abstract concepts and their 4 categories. Right now, I am imagining that the authors curated these themselves. That could be fine, but I would like to know more details, so that we ca
Videos
Taxonomy
TopicsJungian Analytical Psychology · Architecture and Computational Design · Architecture and Cultural Influences
