TL;DR
This paper introduces a novel conditional attention framework for sequential visual tasks, leveraging a global feature descriptor to improve focus on relevant objects, outperforming existing soft attention methods on SVHN and image captioning.
Contribution
It proposes a new conditional attention mechanism using a global feature descriptor, adaptable with different recurrent structures for various visual tasks.
Findings
Achieves state-of-the-art results on SVHN dataset.
Generates better scores than soft attention in image captioning.
Effective across multiple visual tasks with different recurrent modules.
Abstract
Sequential visual task usually requires to pay attention to its current interested object conditional on its previous observations. Different from popular soft attention mechanism, we propose a new attention framework by introducing a novel conditional global feature which represents the weak feature descriptor of the current focused object. Specifically, for a standard CNN (Convolutional Neural Network) pipeline, the convolutional layers with different receptive fields are used to produce the attention maps by measuring how the convolutional features align to the conditional global feature. The conditional global feature can be generated by different recurrent structure according to different visual tasks, such as a simple recurrent neural network for multiple objects recognition, or a moderate complex language model for image caption. Experiments show that our proposed conditional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
