Unleashing Text-to-Image Diffusion Prior for Zero-Shot Image Captioning
Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng,, Hongyang Chao, Ting Yao

TL;DR
This paper introduces PCM-Net, a novel zero-shot image captioning model that uses a patch-wise feature mix-up mechanism and a CLIP-weighted loss to improve semantic alignment with synthetic images, achieving state-of-the-art results.
Contribution
The paper proposes PCM-Net, a new framework that adaptively fuses visual and textual features at the patch level and employs a CLIP-weighted loss for better zero-shot captioning performance.
Findings
PCM-Net outperforms existing methods on MSCOCO and Flickr30k.
Achieves first place in both in-domain and cross-domain zero-shot captioning.
Demonstrates effectiveness of patch-wise feature mix-up and CLIP-weighted loss.
Abstract
Recently, zero-shot image captioning has gained increasing attention, where only text data is available for training. The remarkable progress in text-to-image diffusion model presents the potential to resolve this task by employing synthetic image-caption pairs generated by this pre-trained prior. Nonetheless, the defective details in the salient regions of the synthetic images introduce semantic misalignment between the synthetic image and text, leading to compromised results. To address this challenge, we propose a novel Patch-wise Cross-modal feature Mix-up (PCM) mechanism to adaptively mitigate the unfaithful contents in a fine-grained manner during training, which can be integrated into most of encoder-decoder frameworks, introducing our PCM-Net. Specifically, for each input image, salient visual concepts in the image are first detected considering the image-text similarity in CLIP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsDiffusion · Contrastive Language-Image Pre-training
