Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz,, Bernt Schiele

TL;DR
This paper introduces an adversarial training approach for image captioning that produces diverse, less biased captions closely matching human descriptions, addressing limitations of current models in vocabulary, bias, and diversity.
Contribution
We propose a novel adversarial training method with Gumbel sampling to generate diverse, human-like captions, improving distribution matching and reducing bias.
Findings
Generated captions are more diverse and less biased.
Word statistics of generated captions better match human data.
Achieves comparable accuracy to state-of-the-art methods.
Abstract
While strong progress has been made in image captioning over the last years, machine and human captions are still quite distinct. A closer look reveals that this is due to the deficiencies in the generated word distribution, vocabulary size, and strong bias in the generators towards frequent captions. Furthermore, humans -- rightfully so -- generate multiple, diverse captions, due to the inherent ambiguity in the captioning task which is not considered in today's systems. To address these challenges, we change the training objective of the caption generator from reproducing groundtruth captions to generating a set of captions that is indistinguishable from human generated captions. Instead of handcrafting such a learning target, we employ adversarial training in combination with an approximate Gumbel sampler to implicitly match the generated distribution to the human one. While our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis
