Guiding Attention using Partial-Order Relationships for Image Captioning

Murad Popattia; Muhammad Rafi; Rizwan Qureshi; Shah Nawaz

arXiv:2204.07476·cs.CV·April 18, 2022·1 cites

Guiding Attention using Partial-Order Relationships for Image Captioning

Murad Popattia, Muhammad Rafi, Rizwan Qureshi, Shah Nawaz

PDF

Open Access

TL;DR

This paper introduces a guided attention network for image captioning that leverages partial-order relationships between visual features, topics, and captions in a shared embedding space to improve caption accuracy.

Contribution

It proposes a novel guided attention mechanism using a partial-order embedding space trained with a pairwise ranking objective for better image captioning.

Findings

01

Achieves competitive results on MSCOCO dataset

02

Outperforms several state-of-the-art models on multiple metrics

03

Demonstrates the effectiveness of partial-order relationships in attention models

Abstract

The use of attention models for automated image captioning has enabled many systems to produce accurate and meaningful descriptions for images. Over the years, many novel approaches have been proposed to enhance the attention process using different feature representations. In this paper, we extend this approach by creating a guided attention network mechanism, that exploits the relationship between the visual scene and text-descriptions using spatial features from the image, high-level information from the topics, and temporal context from caption generation, which are embedded together in an ordered embedding space. A pairwise ranking objective is used for training this embedding space which allows similar images, topics and captions in the shared semantic space to maintain a partial order in the visual-semantic hierarchy and hence, helps the model to produce more visually accurate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization