A Comparative Study of Pre-trained CNNs and GRU-Based Attention for   Image Caption Generation

Rashid Khan; Bingding Huang; Haseeb Hassan; Asim Zaman; Zhongfu Ye

arXiv:2310.07252·cs.CV·October 12, 2023·1 cites

A Comparative Study of Pre-trained CNNs and GRU-Based Attention for Image Caption Generation

Rashid Khan, Bingding Huang, Haseeb Hassan, Asim Zaman, Zhongfu Ye

PDF

Open Access

TL;DR

This paper compares pre-trained CNNs and a GRU-based attention mechanism for image captioning, demonstrating a deep neural framework that effectively combines vision and language models to generate descriptive image captions.

Contribution

It introduces a novel deep neural framework integrating multiple pre-trained CNNs with a GRU-based attention mechanism for improved image caption generation.

Findings

01

Achieves competitive scores on MSCOCO and Flickr30k datasets.

02

Effectively combines CNN features with GRU-based attention for captioning.

03

Bridges the gap between computer vision and natural language processing.

Abstract

Image captioning is a challenging task involving generating a textual description for an image using computer vision and natural language processing techniques. This paper proposes a deep neural framework for image caption generation using a GRU-based attention mechanism. Our approach employs multiple pre-trained convolutional neural networks as the encoder to extract features from the image and a GRU-based language model as the decoder to generate descriptive sentences. To improve performance, we integrate the Bahdanau attention model with the GRU decoder to enable learning to focus on specific image parts. We evaluate our approach using the MSCOCO and Flickr30k datasets and show that it achieves competitive scores compared to state-of-the-art methods. Our proposed framework can bridge the gap between computer vision and natural language and can be extended to specific domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques

MethodsFocus · Gated Recurrent Unit