Can Neural Image Captioning be Controlled via Forced Attention?
Philipp Sadler, Tatjana Scheffler, David Schlangen

TL;DR
This paper investigates whether controlling the attention mechanism in neural image captioning models can steer the generated captions towards specific image regions, demonstrating partial success in influencing output content.
Contribution
It introduces methods to fix attention in image captioning models and evaluates their effectiveness in controlling caption content.
Findings
Controlling attention influences caption content in up to 28.56% of cases.
Proposed methods effectively fix attention to specific image regions.
Results suggest potential for controllable image captioning models.
Abstract
Learned dynamic weighting of the conditioning signal (attention) has been shown to improve neural language generation in a variety of settings. The weights applied when generating a particular output sequence have also been viewed as providing a potentially explanatory insight into the internal workings of the generator. In this paper, we reverse the direction of this connection and ask whether through the control of the attention of the model we can control its output. Specifically, we take a standard neural image captioning model that uses attention, and fix the attention to pre-determined areas in the image. We evaluate whether the resulting output is more likely to mention the class of the object in that area than the normally generated caption. We introduce three effective methods to control the attention and find that these are producing expected results in up to 28.56% of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
