A Cognitive Paradigm Approach to Probe the Perception-Reasoning Interface in VLMs
Mohit Vaishnav, Tanel Tammet

TL;DR
This paper introduces a structured evaluation framework inspired by cognitive science to analyze how Vision-Language Models integrate perception and reasoning, revealing perception bottlenecks and achieving state-of-the-art results on complex visual reasoning benchmarks.
Contribution
It proposes three novel evaluation paradigms that dissect the perception-reasoning interface in VLMs, including a componential analysis method that isolates reasoning from perception using textual descriptions.
Findings
Componential Analysis achieves SOTA on Bongard-OpenWorld, Bongard-HOI, and Winoground.
Decoupling perception from reasoning improves model performance.
Perceptual challenges significantly hinder reasoning capabilities.
Abstract
A fundamental challenge in artificial intelligence involves understanding the cognitive mechanisms underlying visual reasoning in sophisticated models like Vision-Language Models (VLMs). How do these models integrate visual perception with abstract thought, especially when reasoning across multiple images or requiring fine-grained compositional understanding? Drawing inspiration from cognitive science, this paper introduces a structured evaluation framework using diverse visual reasoning tasks-Bongard Problems (BPs) and Winoground-to dissect the perception-reasoning interface in VLMs. We propose three distinct evaluation paradigms, mirroring human problem-solving strategies: Direct Visual Rule Learning (DVRL; holistic processing), Deductive Rule Learning (DRL; rule extraction and application), and Componential Analysis (CA; analytical decomposition via task-agnostic textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning
