A Self-Explainable Stylish Image Captioning Framework via   Multi-References

Chengxi Li; Brent Harrison

arXiv:2110.10704·cs.CL·November 19, 2021

A Self-Explainable Stylish Image Captioning Framework via Multi-References

Chengxi Li, Brent Harrison

PDF

Open Access

TL;DR

This paper introduces a stylish image captioning framework called 2M that generates captions with style and provides explanations for errors by analyzing multiple references and input features.

Contribution

The paper presents a novel Multi-style Multi modality mechanism (2M) for stylish image captioning and explanation generation, enhancing both caption quality and interpretability.

Findings

01

2M effectively generates stylish captions.

02

Multi-references support error explanation.

03

Model improves interpretability of captioning errors.

Abstract

In this paper, we propose to build a stylish image captioning model through a Multi-style Multi modality mechanism (2M). We demonstrate that with 2M, we can build an effective stylish captioner and that multi-references produced by the model can also support explaining the model through identifying erroneous input features on faulty examples. We show how this 2M mechanism can be used to build stylish captioning models and show how these models can be utilized to provide explanations of likely errors in the models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition