Image captioning with weakly-supervised attention penalty

Jiayun Li; Mohammad K. Ebrahimpour; Azadeh Moghtaderi; Yen-Yun Yu

arXiv:1903.02507·cs.CV·March 7, 2019·1 cites

Image captioning with weakly-supervised attention penalty

Jiayun Li, Mohammad K. Ebrahimpour, Azadeh Moghtaderi, Yen-Yun Yu

PDF

Open Access

TL;DR

This paper introduces a novel weakly-supervised attention alignment method for image captioning that improves caption quality by aligning predicted attention maps with intrinsic visual attentions without requiring bounding box annotations.

Contribution

It proposes a new model that explicitly aligns captioning attention with intrinsic visual attention, enhancing captioning performance without expensive annotations.

Findings

01

Improved language evaluation metrics on COCO dataset

02

Enhanced captioning accuracy on genealogical dataset

03

Effective attention alignment without bounding box supervision

Abstract

Stories are essential for genealogy research since they can help build emotional connections with people. A lot of family stories are reserved in historical photos and albums. Recent development on image captioning models makes it feasible to "tell stories" for photos automatically. The attention mechanism has been widely adopted in many state-of-the-art encoder-decoder based image captioning models, since it can bridge the gap between the visual part and the language part. Most existing captioning models implicitly trained attention modules with word-likelihood loss. Meanwhile, lots of studies have investigated intrinsic attentions for visual models using gradient-based approaches. Ideally, attention maps predicted by captioning models should be consistent with intrinsic attentions from visual models for any given visual concept. However, no work has been done to align implicitly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques