KiVA: Kid-inspired Visual Analogies for Testing Large Multimodal Models
Eunice Yiu, Maan Qraitem, Anisa Noor Majhi, Charlie Wong, Yutong Bai, Shiry Ginosar, Alison Gopnik, Kate Saenko

TL;DR
This study introduces a new benchmark inspired by developmental psychology to evaluate large multimodal models' ability to perform visual analogical reasoning, revealing their limitations compared to children and adults.
Contribution
We propose a novel benchmark of 4,300 visual transformations to assess LMMs on analogical reasoning, highlighting their struggles with complex rules compared to human children and adults.
Findings
Models excel at identifying visual changes but struggle with applying and extrapolating rules.
Children and adults outperform models in all stages of analogical reasoning.
Complex tasks involving spatial understanding are particularly challenging for models.
Abstract
This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A "visual analogy" is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the "what" effectively, they struggle with…
Peer Reviews
Decision·ICLR 2025 Poster
1. The dataset, inspired by developmental psychology, is unique in its simplicity, enabling assessments that even young children can complete. Its three-stage structure offers a clear breakdown of different analogical reasoning abilities in LMMs versus humans. 2. Extensive experimentation demonstrates specific strengths and weaknesses of LMMs, providing critical insights. For example, while models can recognize "what" changed in an image, they struggle to quantify "how" it changed and to gen
1. The selection of visual analogy domains, while simple and fundamental, lacks sufficient justification regarding why these specific transformations were chosen over others. Intuitively, additional characteristics—such as edibility, danger, sharpness, and liveliness—are also essential features humans consider. For more complex natural scenes, it’s unclear whether the selected features are more significant than others. The authors can provide further rationale for choosing these five factors or
- Breaking down visual analogies into these reasoning steps helps to highlight exactly where humans and models fail - Presenting both adult and child data on the benchmark questions is valuable. The human studies appear well conducted. - Various additional common steps are evaluated to improve model performance, which interestingly do not seem to change model performance greatly. - Examination of model response consistency helps to unpack where model decisions go wrong.
1. The rotation task does not appear to assess 3D rotation, which is the main focus of studies of mental rotation from cognitive psychology. As far as I can tell, these rotation tasks could in principle be solved by rotating the image plane (e.g. pixels, monitor, or the participant's head). Since you have 3D objects, why not add real 3D rotations (where a "hidden" part of the object due to self-occlusion is revealed)? This task would further strengthen the challenge of the benchmark. 2. Addition
The writing is clear and well-structured. Figures also clearly demonstrate the test tasks and their results. The authors introduce a novel and well-motivated benchmark for studying LMM capabilities. The test is grounded by using real-world objects, and draws inspiration from developmental psychology. The three stages introduced by the authors help clarify where LMMs have shortcomings. The experimental design is rigorous, and validated with human studies of both children and adults. The analysis
The discussion could be expanded with discussion of why models tend to fail at certain transformations, outside of investigating their consistency. While the paper mentions objects were "handpicked by developmental psychology experts", it doesn't detail the selection criteria or validation process. There's no reported validation that the transformations are equally discriminable across categories. For instance, an example image shows a die face with five dots - almost completely symmetric under
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Speech and dialogue systems
