CgT-GAN: CLIP-guided Text GAN for Image Captioning

Jiarui Yu; Haoran Li; Yanbin Hao; Bin Zhu; Tong Xu; Xiangnan He

arXiv:2308.12045·cs.CV·August 24, 2023

CgT-GAN: CLIP-guided Text GAN for Image Captioning

Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu, Xiangnan He

PDF

1 Repo

TL;DR

This paper introduces CgT-GAN, a novel image captioning model that leverages CLIP guidance and adversarial training to generate more natural and semantically accurate captions without relying on human-annotated datasets.

Contribution

The paper proposes CgT-GAN, integrating images into training with adversarial and CLIP-based semantic rewards, improving captioning quality without human annotations.

Findings

01

Outperforms state-of-the-art methods on three captioning subtasks.

02

Uses novel CLIP-agg semantic guidance for better caption alignment.

03

Demonstrates significant improvements in caption naturalness and semantic accuracy.

Abstract

The large-scale visual-language pre-trained model, Contrastive Language-Image Pre-training (CLIP), has significantly improved image captioning for scenarios without human-annotated image-caption pairs. Recent advanced CLIP-based image captioning without human annotations follows a text-only training paradigm, i.e., reconstructing text from shared embedding space. Nevertheless, these approaches are limited by the training/inference gap or huge storage requirements for text embeddings. Given that it is trivial to obtain images in the real world, we propose CLIP-guided text GAN (CgT-GAN), which incorporates images into the training process to enable the model to "see" real visual modality. Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance. The caption generator is jointly rewarded…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lihr747/cgtgan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.