Speaking the Same Language: Matching Machine to Human Captions by   Adversarial Training

Rakshith Shetty; Marcus Rohrbach; Lisa Anne Hendricks; Mario Fritz,; Bernt Schiele

arXiv:1703.10476·cs.CV·November 7, 2017·32 cites

Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz,, Bernt Schiele

PDF

Open Access 3 Repos

TL;DR

This paper introduces an adversarial training approach for image captioning that produces diverse, less biased captions closely matching human descriptions, addressing limitations of current models in vocabulary, bias, and diversity.

Contribution

We propose a novel adversarial training method with Gumbel sampling to generate diverse, human-like captions, improving distribution matching and reducing bias.

Findings

01

Generated captions are more diverse and less biased.

02

Word statistics of generated captions better match human data.

03

Achieves comparable accuracy to state-of-the-art methods.

Abstract

While strong progress has been made in image captioning over the last years, machine and human captions are still quite distinct. A closer look reveals that this is due to the deficiencies in the generated word distribution, vocabulary size, and strong bias in the generators towards frequent captions. Furthermore, humans -- rightfully so -- generate multiple, diverse captions, due to the inherent ambiguity in the captioning task which is not considered in today's systems. To address these challenges, we change the training objective of the caption generator from reproducing groundtruth captions to generating a set of captions that is indistinguishable from human generated captions. Instead of handcrafting such a learning target, we employ adversarial training in combination with an approximate Gumbel sampler to implicitly match the generated distribution to the human one. While our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis