Image Captioning with Integrated Bottom-Up and Multi-level Residual   Top-Down Attention for Game Scene Understanding

Jian Zheng; Sudha Krishnamurthy; Ruxin Chen; Min-Hung Chen; Zhenhao; Ge; Xiaohua Li

arXiv:1906.06632·cs.CV·June 18, 2019·1 cites

Image Captioning with Integrated Bottom-Up and Multi-level Residual Top-Down Attention for Game Scene Understanding

Jian Zheng, Sudha Krishnamurthy, Ruxin Chen, Min-Hung Chen, Zhenhao, Ge, Xiaohua Li

PDF

Open Access

TL;DR

This paper introduces a novel game image captioning model that combines bottom-up attention with multi-level residual top-down attention, effectively capturing spatial details for improved captioning in game scenes.

Contribution

The work proposes a new multi-level residual top-down attention mechanism integrated with bottom-up attention for game image captioning, addressing spatial information loss.

Findings

01

Model outperforms baseline models on game datasets

02

Enhanced spatial feature retention improves caption quality

03

Effective fusion of regional features for game scene understanding

Abstract

Image captioning has attracted considerable attention in recent years. However, little work has been done for game image captioning which has some unique characteristics and requirements. In this work we propose a novel game image captioning model which integrates bottom-up attention with a new multi-level residual top-down attention mechanism. Firstly, a lower-level residual top-down attention network is added to the Faster R-CNN based bottom-up attention network to address the problem that the latter may lose important spatial information when extracting regional features. Secondly, an upper-level residual top-down attention network is implemented in the caption generation network to better fuse the extracted regional features for subsequent caption prediction. We create two game datasets to evaluate the proposed model. Extensive experiments show that our proposed model outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsRegion Proposal Network · Softmax · Convolution · RoIPool · Faster R-CNN