Explainable Image Captioning using CNN- CNN architecture and   Hierarchical Attention

Rishi Kesav Mohan; Sanjay Sureshkumar; Vignesh Sivasubramaniam

arXiv:2407.09556·cs.CV·July 16, 2024

Explainable Image Captioning using CNN- CNN architecture and Hierarchical Attention

Rishi Kesav Mohan, Sanjay Sureshkumar, Vignesh Sivasubramaniam

PDF

Open Access

TL;DR

This paper introduces an explainable image captioning model using a CNN-CNN architecture with hierarchical attention, enhancing transparency, speed, and accuracy in caption generation, and validated on the MSCOCO dataset.

Contribution

It proposes a novel CNN-based architecture with hierarchical attention for explainable image captioning, improving interpretability, speed, and accuracy over traditional black-box models.

Findings

01

Improved captioning accuracy demonstrated on MSCOCO dataset

02

Enhanced model interpretability through visualization of explanations

03

Faster caption generation compared to existing methods

Abstract

Image captioning is a technology that produces text-based descriptions for an image. Deep learning-based solutions built on top of feature recognition may very well serve the purpose. But as with any other machine learning solution, the user understanding in the process of caption generation is poor and the model does not provide any explanation for its predictions and hence the conventional methods are also referred to as Black-Box methods. Thus, an approach where the model's predictions are trusted by the user is needed to appreciate interoperability. Explainable AI is an approach where a conventional method is approached in a way that the model or the algorithm's predictions can be explainable and justifiable. Thus, this article tries to approach image captioning using Explainable AI such that the resulting captions generated by the model can be Explained and visualized. A newer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Video Analysis and Summarization

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings