PhD Thesis: Exploring the role of (self-)attention in cognitive and computer vision architecture
Mohit Vaishnav

TL;DR
This thesis explores how self-attention and memory mechanisms improve reasoning in vision tasks, proposing new models and architectures that enhance performance, efficiency, and generalization in complex visual reasoning.
Contribution
It introduces a refined taxonomy of reasoning tasks, extends Transformer models with memory, and proposes GAMR, a cognitive architecture that outperforms existing models in multiple aspects.
Findings
Self-attention enhances feature extraction in visual reasoning.
GAMR outperforms other models in sample efficiency and robustness.
Zero-shot generalization achieved on new reasoning tasks.
Abstract
We investigate the role of attention and memory in complex reasoning tasks. We analyze Transformer-based self-attention as a model and extend it with memory. By studying a synthetic visual reasoning test, we refine the taxonomy of reasoning tasks. Incorporating self-attention with ResNet50, we enhance feature maps using feature-based and spatial attention, achieving efficient solving of challenging visual reasoning tasks. Our findings contribute to understanding the attentional needs of SVRT tasks. Additionally, we propose GAMR, a cognitive architecture combining attention and memory, inspired by active vision theory. GAMR outperforms other architectures in sample efficiency, robustness, and compositionality, and shows zero-shot generalization on new reasoning tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection
