TL;DR
This paper investigates how neural image caption generators utilize visual information, revealing that their sensitivity varies with word type and caption position, contributing to explainability in AI.
Contribution
It provides a sensitivity and omission analysis of caption generators, highlighting their varying reliance on visual input across different words and caption positions.
Findings
Sensitivity to visual input varies by word type
Caption generators retain different levels of visual information
Analysis advances explainability in neural captioning models
Abstract
This paper addresses the sensitivity of neural image caption generators to their visual input. A sensitivity analysis and omission analysis based on image foils is reported, showing that the extent to which image captioning architectures retain and are sensitive to visual information varies depending on the type of word being generated and the position in the caption as a whole. We motivate this work in the context of broader goals in the field to achieve more explainability in AI.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
