Cooperative image captioning
Gilad Vered, Gal Oren, Yuval Atzmon, Gal Chechik

TL;DR
This paper introduces PSST, a new training method for cooperative image captioning that improves the discriminative quality and naturalness of generated descriptions by addressing optimization challenges and constraining language to be human-like.
Contribution
The paper proposes PSST, a novel optimization technique for joint training of speaker and listener networks, and demonstrates how constraining descriptions to human language enhances naturalness and discriminativeness.
Findings
Recall@10 improved from 60% to 86% on COCO
Descriptions are more natural and discriminative
Method maintains language naturalness while improving task performance
Abstract
When describing images with natural language, the descriptions can be made more informative if tuned using downstream tasks. This is often achieved by training two networks: a "speaker network" that generates sentences given an image, and a "listener network" that uses them to perform a task. Unfortunately, training multiple networks jointly to communicate to achieve a joint task, faces two major challenges. First, the descriptions generated by a speaker network are discrete and stochastic, making optimization very hard and inefficient. Second, joint training usually causes the vocabulary used during communication to drift and diverge from natural language. We describe an approach that addresses both challenges. We first develop a new effective optimization based on partial-sampling from a multinomial distribution combined with straight-through gradient updates, which we name PSST for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
