Stacked Cross-modal Feature Consolidation Attention Networks for Image   Captioning

Mozhgan Pourkeshavarz; Shahabedin Nabavi; Mohsen Ebrahimi Moghaddam,; Mehrnoush Shamsfard

arXiv:2302.04676·cs.CV·April 10, 2024

Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning

Mozhgan Pourkeshavarz, Shahabedin Nabavi, Mohsen Ebrahimi Moghaddam,, Mehrnoush Shamsfard

PDF

Open Access

TL;DR

This paper introduces a novel stacked cross-modal feature consolidation attention network for image captioning, integrating high-level semantics and visual cues through multi-step reasoning to generate more detailed captions.

Contribution

It proposes a new SCFC attention network with a compounding function and context-aware attributes, advancing end-to-end image captioning methods.

Findings

01

Outperforms state-of-the-art on MSCOCO and Flickr30K datasets

02

Effectively combines semantic and visual information for detailed captions

03

Uses a novel multi-step reasoning approach for feature consolidation

Abstract

Recently, the attention-enriched encoder-decoder framework has aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization