Automated Image Captioning with CNNs and Transformers

Joshua Adrian Cahyono; Jeremy Nathan Jusuf

arXiv:2412.10511·cs.CV·December 17, 2024

Automated Image Captioning with CNNs and Transformers

Joshua Adrian Cahyono, Jeremy Nathan Jusuf

PDF

Open Access 1 Repo

TL;DR

This paper presents an automated image captioning system that combines CNNs and Transformers, leveraging advanced attention mechanisms and hyperparameter tuning to improve the quality of generated descriptions evaluated by standard metrics.

Contribution

It introduces a hybrid CNN-Transformer architecture with attention mechanisms and hyperparameter optimization for enhanced image captioning performance.

Findings

01

Transformer-based models outperform CNN-RNN in caption quality

02

Attention mechanisms significantly improve description relevance

03

Optimized hyperparameters lead to higher evaluation scores

Abstract

This project aims to create an automated image captioning system that generates natural language descriptions for input images by integrating techniques from computer vision and natural language processing. We employ various different techniques, ranging from CNN-RNN to the more advanced transformer-based techniques. Training is carried out on image datasets paired with descriptive captions, and model performance will be evaluated using established metrics such as BLEU, METEOR, and CIDEr. The project will also involve experimentation with advanced attention mechanisms, comparisons of different architectural choices, and hyperparameter optimization to refine captioning accuracy and overall system effectiveness.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JeremyNathanJusuf/image-captioning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need