Explainable Image Captioning using CNN- CNN architecture and Hierarchical Attention
Rishi Kesav Mohan, Sanjay Sureshkumar, Vignesh Sivasubramaniam

TL;DR
This paper introduces an explainable image captioning model using a CNN-CNN architecture with hierarchical attention, enhancing transparency, speed, and accuracy in caption generation, and validated on the MSCOCO dataset.
Contribution
It proposes a novel CNN-based architecture with hierarchical attention for explainable image captioning, improving interpretability, speed, and accuracy over traditional black-box models.
Findings
Improved captioning accuracy demonstrated on MSCOCO dataset
Enhanced model interpretability through visualization of explanations
Faster caption generation compared to existing methods
Abstract
Image captioning is a technology that produces text-based descriptions for an image. Deep learning-based solutions built on top of feature recognition may very well serve the purpose. But as with any other machine learning solution, the user understanding in the process of caption generation is poor and the model does not provide any explanation for its predictions and hence the conventional methods are also referred to as Black-Box methods. Thus, an approach where the model's predictions are trusted by the user is needed to appreciate interoperability. Explainable AI is an approach where a conventional method is approached in a way that the model or the algorithm's predictions can be explainable and justifiable. Thus, this article tries to approach image captioning using Explainable AI such that the resulting captions generated by the model can be Explained and visualized. A newer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Video Analysis and Summarization
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
