Image captioning with weakly-supervised attention penalty
Jiayun Li, Mohammad K. Ebrahimpour, Azadeh Moghtaderi, Yen-Yun Yu

TL;DR
This paper introduces a novel weakly-supervised attention alignment method for image captioning that improves caption quality by aligning predicted attention maps with intrinsic visual attentions without requiring bounding box annotations.
Contribution
It proposes a new model that explicitly aligns captioning attention with intrinsic visual attention, enhancing captioning performance without expensive annotations.
Findings
Improved language evaluation metrics on COCO dataset
Enhanced captioning accuracy on genealogical dataset
Effective attention alignment without bounding box supervision
Abstract
Stories are essential for genealogy research since they can help build emotional connections with people. A lot of family stories are reserved in historical photos and albums. Recent development on image captioning models makes it feasible to "tell stories" for photos automatically. The attention mechanism has been widely adopted in many state-of-the-art encoder-decoder based image captioning models, since it can bridge the gap between the visual part and the language part. Most existing captioning models implicitly trained attention modules with word-likelihood loss. Meanwhile, lots of studies have investigated intrinsic attentions for visual models using gradient-based approaches. Ideally, attention maps predicted by captioning models should be consistent with intrinsic attentions from visual models for any given visual concept. However, no work has been done to align implicitly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
