Towards Diverse and Natural Image Descriptions via a Conditional GAN
Bo Dai, Sanja Fidler, Raquel Urtasun, Dahua Lin

TL;DR
This paper introduces a Conditional GAN framework for image captioning that enhances the diversity and naturalness of generated descriptions, overcoming limitations of traditional likelihood-based models.
Contribution
It proposes a novel CGAN-based approach with reinforcement learning to generate more diverse and natural image descriptions, addressing the rigidity of existing methods.
Findings
Outperforms existing methods on large datasets
Achieves human-level performance in user studies
Produces more diverse and natural captions
Abstract
Despite the substantial progress in recent years, the image captioning techniques are still far from being perfect.Sentences produced by existing methods, e.g. those based on RNNs, are often overly rigid and lacking in variability. This issue is related to a learning principle widely used in practice, that is, to maximize the likelihood of training samples. This principle encourages high resemblance to the "ground-truth" captions while suppressing other reasonable descriptions. Conventional evaluation metrics, e.g. BLEU and METEOR, also favor such restrictive methods. In this paper, we explore an alternative approach, with the aim to improve the naturalness and diversity -- two essential properties of human expression. Specifically, we propose a new framework based on Conditional Generative Adversarial Networks (CGAN), which jointly learns a generator to produce descriptions conditioned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Towards Diverse and Natural Image Descriptions via a Conditional GAN· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
