A Self-Explainable Stylish Image Captioning Framework via Multi-References
Chengxi Li, Brent Harrison

TL;DR
This paper introduces a stylish image captioning framework called 2M that generates captions with style and provides explanations for errors by analyzing multiple references and input features.
Contribution
The paper presents a novel Multi-style Multi modality mechanism (2M) for stylish image captioning and explanation generation, enhancing both caption quality and interpretability.
Findings
2M effectively generates stylish captions.
Multi-references support error explanation.
Model improves interpretability of captioning errors.
Abstract
In this paper, we propose to build a stylish image captioning model through a Multi-style Multi modality mechanism (2M). We demonstrate that with 2M, we can build an effective stylish captioner and that multi-references produced by the model can also support explaining the model through identifying erroneous input features on faulty examples. We show how this 2M mechanism can be used to build stylish captioning models and show how these models can be utilized to provide explanations of likely errors in the models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
