Hyperparameter Analysis for Image Captioning

Amish Patel; Aravind Varier

arXiv:2006.10923·cs.CV·June 22, 2020·1 cites

Hyperparameter Analysis for Image Captioning

Amish Patel, Aravind Varier

PDF

Open Access

TL;DR

This paper conducts a detailed sensitivity analysis of image captioning models, revealing that fine-tuning CNN encoders significantly improves performance across CNN+LSTM and CNN+Transformer architectures on Flickr8k.

Contribution

It provides a comprehensive analysis of hyperparameter effects, highlighting the importance of CNN encoder fine-tuning in image captioning models.

Findings

01

Fine-tuning CNN encoders outperforms baseline models.

02

CNN+LSTM and CNN+Transformer architectures show similar sensitivity patterns.

03

Fine-tuning consistently improves captioning accuracy.

Abstract

In this paper, we perform a thorough sensitivity analysis on state-of-the-art image captioning approaches using two different architectures: CNN+LSTM and CNN+Transformer. Experiments were carried out using the Flickr8k dataset. The biggest takeaway from the experiments is that fine-tuning the CNN encoder outperforms the baseline and all other experiments carried out for both architectures.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition