Aligning where to see and what to tell: image caption with region-based   attention and scene factorization

Junqi Jin; Kun Fu; Runpeng Cui; Fei Sha; Changshui Zhang

arXiv:1506.06272·cs.CV·June 23, 2015·107 cites

Aligning where to see and what to tell: image caption with region-based attention and scene factorization

Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, Changshui Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces an image captioning system that aligns visual regions with sentence structure and incorporates scene-specific contexts, achieving state-of-the-art results by combining attention mechanisms and semantic scene modeling.

Contribution

It presents a novel image captioning approach that aligns visual attention with sentence structure and integrates scene-specific semantic contexts, improving caption quality.

Findings

01

Region-based attention improves caption accuracy.

02

Scene-specific contexts enhance semantic relevance.

03

Combining both methods achieves state-of-the-art performance.

Abstract

Recent progress on automatic generation of image captions has shown that it is possible to describe the most salient information conveyed by images with accurate and meaningful sentences. In this paper, we propose an image caption system that exploits the parallel structures between images and sentences. In our model, the process of generating the next word, given the previously generated ones, is aligned with the visual perception experience where the attention shifting among the visual regions imposes a thread of visual ordering. This alignment characterizes the flow of "abstract meaning", encoding what is semantically shared by both the visual scene and the text description. Our system also makes another novel modeling contribution by introducing scene-specific contexts that capture higher-level semantic information encoded in an image. The contexts adapt language models for word…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fukun07/neural-image-captioning
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization