Image Captioning with Semantic Attention
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, Jiebo Luo

TL;DR
This paper introduces a semantic attention model for image captioning that combines top-down and bottom-up approaches, leading to significant improvements over existing methods on benchmark datasets.
Contribution
It proposes a novel semantic attention mechanism that fuses top-down and bottom-up processes for more accurate image captioning.
Findings
Outperforms state-of-the-art methods on COCO and Flickr30K datasets
Significantly improves evaluation metrics across benchmarks
Demonstrates effective integration of semantic concepts in caption generation
Abstract
Automatically generating a natural language description of an image has attracted interests recently both because of its importance in practical applications and because it connects two major artificial intelligence fields: computer vision and natural language processing. Existing approaches are either top-down, which start from a gist of an image and convert it into words, or bottom-up, which come up with words describing various aspects of an image and then combine them. In this paper, we propose a new algorithm that combines both approaches through a model of semantic attention. Our algorithm learns to selectively attend to semantic concept proposals and fuse them into hidden states and outputs of recurrent neural networks. The selection and fusion form a feedback connecting the top-down and bottom-up computation. We evaluate our algorithm on two public benchmarks: Microsoft COCO and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Image Captioning With Semantic Attention· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
