Image Captioning with Integrated Bottom-Up and Multi-level Residual Top-Down Attention for Game Scene Understanding
Jian Zheng, Sudha Krishnamurthy, Ruxin Chen, Min-Hung Chen, Zhenhao, Ge, Xiaohua Li

TL;DR
This paper introduces a novel game image captioning model that combines bottom-up attention with multi-level residual top-down attention, effectively capturing spatial details for improved captioning in game scenes.
Contribution
The work proposes a new multi-level residual top-down attention mechanism integrated with bottom-up attention for game image captioning, addressing spatial information loss.
Findings
Model outperforms baseline models on game datasets
Enhanced spatial feature retention improves caption quality
Effective fusion of regional features for game scene understanding
Abstract
Image captioning has attracted considerable attention in recent years. However, little work has been done for game image captioning which has some unique characteristics and requirements. In this work we propose a novel game image captioning model which integrates bottom-up attention with a new multi-level residual top-down attention mechanism. Firstly, a lower-level residual top-down attention network is added to the Faster R-CNN based bottom-up attention network to address the problem that the latter may lose important spatial information when extracting regional features. Secondly, an upper-level residual top-down attention network is implemented in the caption generation network to better fuse the extracted regional features for subsequent caption prediction. We create two game datasets to evaluate the proposed model. Extensive experiments show that our proposed model outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsRegion Proposal Network · Softmax · Convolution · RoIPool · Faster R-CNN
