A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation
Rashid Khan, Bingding Huang, Haseeb Hassan, Asim Zaman, Zhongfu Ye

TL;DR
This paper compares pre-trained CNNs and a GRU-based attention mechanism for image captioning, demonstrating a deep neural framework that effectively combines vision and language models to generate descriptive image captions.
Contribution
It introduces a novel deep neural framework integrating multiple pre-trained CNNs with a GRU-based attention mechanism for improved image caption generation.
Findings
Achieves competitive scores on MSCOCO and Flickr30k datasets.
Effectively combines CNN features with GRU-based attention for captioning.
Bridges the gap between computer vision and natural language processing.
Abstract
Image captioning is a challenging task involving generating a textual description for an image using computer vision and natural language processing techniques. This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism. Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate descriptive sentences. To improve performance, we integrate the Bahdanau attention model with the GRU decoder to enable learning to focus on specific image parts. We evaluate our approach using the MSCOCO and Flickr30k datasets and show that it achieves competitive scores compared to state-of-the-art methods. Our proposed framework can bridge the gap between computer vision and natural language and can be extended to specific domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsFocus · Gated Recurrent Unit
