Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data
Parag Dutta, Ambedkar Dukkipati

TL;DR
This paper introduces LoGIC, a multi-agent reinforcement learning framework where a speaker and listener learn to communicate naturally, significantly improving image captioning performance without additional labeled data.
Contribution
It presents a novel communication game approach for unsupervised image captioning using pre-trained models and lightweight components, achieving state-of-the-art results.
Findings
Achieved 46 BLEU score with pre-trained models, surpassing vanilla VLMs.
Obtained 31 BLEU score with lightweight components, outperforming existing unsupervised methods.
Demonstrated that emergent communication improves captioning without extra data.
Abstract
Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a 'speaker' and a 'listener', with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
