Automated Image Captioning with CNNs and Transformers
Joshua Adrian Cahyono, Jeremy Nathan Jusuf

TL;DR
This paper presents an automated image captioning system that combines CNNs and Transformers, leveraging advanced attention mechanisms and hyperparameter tuning to improve the quality of generated descriptions evaluated by standard metrics.
Contribution
It introduces a hybrid CNN-Transformer architecture with attention mechanisms and hyperparameter optimization for enhanced image captioning performance.
Findings
Transformer-based models outperform CNN-RNN in caption quality
Attention mechanisms significantly improve description relevance
Optimized hyperparameters lead to higher evaluation scores
Abstract
This project aims to create an automated image captioning system that generates natural language descriptions for input images by integrating techniques from computer vision and natural language processing. We employ various different techniques, ranging from CNN-RNN to the more advanced transformer-based techniques. Training is carried out on image datasets paired with descriptive captions, and model performance will be evaluated using established metrics such as BLEU, METEOR, and CIDEr. The project will also involve experimentation with advanced attention mechanisms, comparisons of different architectural choices, and hyperparameter optimization to refine captioning accuracy and overall system effectiveness.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsSoftmax · Attention Is All You Need
