Stacked Cross-modal Feature Consolidation Attention Networks for Image Captioning
Mozhgan Pourkeshavarz, Shahabedin Nabavi, Mohsen Ebrahimi Moghaddam,, Mehrnoush Shamsfard

TL;DR
This paper introduces a novel stacked cross-modal feature consolidation attention network for image captioning, integrating high-level semantics and visual cues through multi-step reasoning to generate more detailed captions.
Contribution
It proposes a new SCFC attention network with a compounding function and context-aware attributes, advancing end-to-end image captioning methods.
Findings
Outperforms state-of-the-art on MSCOCO and Flickr30K datasets
Effectively combines semantic and visual information for detailed captions
Uses a novel multi-step reasoning approach for feature consolidation
Abstract
Recently, the attention-enriched encoder-decoder framework has aroused great interest in image captioning due to its overwhelming progress. Many visual attention models directly leverage meaningful regions to generate image descriptions. However, seeking a direct transition from visual space to text is not enough to generate fine-grained captions. This paper exploits a feature-compounding approach to bring together high-level semantic concepts and visual information regarding the contextual environment fully end-to-end. Thus, we propose a stacked cross-modal feature consolidation (SCFC) attention network for image captioning in which we simultaneously consolidate cross-modal features through a novel compounding function in a multi-step reasoning fashion. Besides, we jointly employ spatial information and context-aware attributes (CAA) as the principal components in our proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
