Generating Descriptions for Sequential Images with Local-Object Attention and Global Semantic Context Modelling
Jing Su, Chenghua Lin, Mian Zhou, Qingyun Dai, Haoyu Lv

TL;DR
This paper introduces an end-to-end CNN-LSTM model with local-object attention and global semantic context modeling for generating coherent descriptions of sequential images, outperforming baselines on Microsoft datasets.
Contribution
It presents a novel CNN-LSTM architecture incorporating local-object attention and global context modeling for sequential image captioning.
Findings
Outperforms baseline models on three evaluation metrics.
Effective local-object attention improves description relevance.
Global semantic context enhances coherence in sequence descriptions.
Abstract
In this paper, we propose an end-to-end CNN-LSTM model for generating descriptions for sequential images with a local-object attention mechanism. To generate coherent descriptions, we capture global semantic context using a multi-layer perceptron, which learns the dependencies between sequential images. A paralleled LSTM network is exploited for decoding the sequence descriptions. Experimental results show that our model outperforms the baseline across three different evaluation metrics on the datasets published by Microsoft.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory
