TL;DR
This paper introduces CgT-GAN, a novel image captioning model that leverages CLIP guidance and adversarial training to generate more natural and semantically accurate captions without relying on human-annotated datasets.
Contribution
The paper proposes CgT-GAN, integrating images into training with adversarial and CLIP-based semantic rewards, improving captioning quality without human annotations.
Findings
Outperforms state-of-the-art methods on three captioning subtasks.
Uses novel CLIP-agg semantic guidance for better caption alignment.
Demonstrates significant improvements in caption naturalness and semantic accuracy.
Abstract
The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
