Recurrent Attention Models with Object-centric Capsule Representation for Multi-object Recognition
Hossein Adeli, Seoyoung Ahn, Gregory Zelinsky

TL;DR
This paper introduces a recurrent attention model using object-centric capsule representations that effectively integrates attention and recognition for multi-object recognition tasks, demonstrating improved performance with minimal supervision.
Contribution
It presents a novel encoder-decoder model combining capsule networks with iterative glimpse attention for integrated object recognition and scene understanding.
Findings
Effective recognition of overlapping objects and cluttered scenes
Model learns to move its glimpse window adaptively
Achieves high accuracy with only classification supervision
Abstract
The visual system processes a scene using a sequence of selective glimpses, each driven by spatial and object-based attention. These glimpses reflect what is relevant to the ongoing task and are selected through recurrent processing and recognition of the objects in the scene. In contrast, most models treat attention selection and recognition as separate stages in a feedforward process. Here we show that using capsule networks to create an object-centric hidden representation in an encoder-decoder model with iterative glimpse attention yields effective integration of attention and recognition. We evaluate our model on three multi-object recognition tasks; highly overlapping digits, digits among distracting clutter and house numbers, and show that it learns to effectively move its glimpse window, recognize and reconstruct the objects, all with only the classification as supervision. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
