Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies
Ayushman Sarkar, Mohd Yamani Idna Idris, Zhenyu Yu

TL;DR
This survey categorizes and analyzes various types of visual reasoning in computer vision, examining models, tasks, evaluation protocols, and open challenges to guide future research towards more integrated and trustworthy AI systems.
Contribution
It provides a comprehensive taxonomy of visual reasoning types, reviews diverse models and evaluation methods, and identifies key open challenges for advancing the field.
Findings
Relational, symbolic, temporal, causal, and commonsense reasoning are systematically categorized.
Evaluation protocols are critically analyzed for their limitations in generalizability and reproducibility.
Open challenges include scalability, integration of paradigms, and lack of comprehensive benchmarks.
Abstract
Visual reasoning is critical for a wide range of computer vision tasks that go beyond surface-level object detection and classification. Despite notable advances in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys often address these directions in isolation, lacking a unified analysis and comparison across reasoning types, methodologies, and evaluation protocols. This survey aims to address this gap by categorizing visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and systematically examining their implementation through architectures such as graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems. We review evaluation protocols designed to assess functional correctness, structural consistency, and causal validity, and critically analyze their limitations in terms of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
