Multiple Object Recognition with Visual Attention

Jimmy Ba; Volodymyr Mnih; Koray Kavukcuoglu

arXiv:1412.7755·cs.LG·April 24, 2015·ICLR·702 cites

Multiple Object Recognition with Visual Attention

Jimmy Ba, Volodymyr Mnih, Koray Kavukcuoglu

PDF

Open Access 5 Repos

TL;DR

This paper introduces an attention-based deep recurrent neural network that localizes and recognizes multiple objects in images using reinforcement learning, outperforming existing methods on street view number transcription.

Contribution

It presents a novel attention mechanism trained with reinforcement learning for multi-object recognition, reducing parameters and computational cost.

Findings

01

Outperforms state-of-the-art convolutional networks in accuracy

02

Learns to localize and recognize objects with only class labels

03

Uses fewer parameters and less computation

Abstract

We present an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the most relevant regions of the input image. We show that the model learns to both localize and recognize multiple objects despite being given only class labels during training. We evaluate the model on the challenging task of transcribing house number sequences from Google Street View images and show that it is both more accurate than the state-of-the-art convolutional networks and uses fewer parameters and less computation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning