Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation
Wei Wei, Ling Cheng, Xianling Mao, Guangyou Zhou, and Feida Zhu

TL;DR
This paper introduces Stack-VS, a multi-stage architecture combining top-down and bottom-up attention models with a novel stack decoder for generating detailed image captions, significantly improving performance on MSCOCO.
Contribution
The paper proposes a new multi-stage Stack-VS architecture with a stack decoder that effectively integrates visual and semantic information for fine-grained image captioning.
Findings
Significant improvements in BLEU-4, CIDEr, and SPICE scores on MSCOCO.
Effective combination of top-down and bottom-up attention models.
Enhanced decoder structure with interactive LSTM layers.
Abstract
Recently, automatic image caption generation has been an important focus of the work on multimodal translation task. Existing approaches can be roughly categorized into two classes, i.e., top-down and bottom-up, the former transfers the image information (called as visual-level feature) directly into a caption, and the later uses the extracted words (called as semanticlevel attribute) to generate a description. However, previous methods either are typically based one-stage decoder or partially utilize part of visual-level or semantic-level information for image caption generation. In this paper, we address the problem and propose an innovative multi-stage architecture (called as Stack-VS) for rich fine-gained image caption generation, via combining bottom-up and top-down attention models to effectively handle both visual-level and semantic-level information of an input image.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
