Hyperparameter Analysis for Image Captioning
Amish Patel, Aravind Varier

TL;DR
This paper conducts a detailed sensitivity analysis of image captioning models, revealing that fine-tuning CNN encoders significantly improves performance across CNN+LSTM and CNN+Transformer architectures on Flickr8k.
Contribution
It provides a comprehensive analysis of hyperparameter effects, highlighting the importance of CNN encoder fine-tuning in image captioning models.
Findings
Fine-tuning CNN encoders outperforms baseline models.
CNN+LSTM and CNN+Transformer architectures show similar sensitivity patterns.
Fine-tuning consistently improves captioning accuracy.
Abstract
In this paper, we perform a thorough sensitivity analysis on state-of-the-art image captioning approaches using two different architectures: CNN+LSTM and CNN+Transformer. Experiments were carried out using the Flickr8k dataset. The biggest takeaway from the experiments is that fine-tuning the CNN encoder outperforms the baseline and all other experiments carried out for both architectures.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
