Controllable Image Captioning via Prompting
Ning Wang, Jiahao Xie, Jihao Wu, Mingbo Jia, Linlin Li

TL;DR
This paper introduces a prompt-based method for controllable image captioning, enabling a single model to generate diverse, stylized captions across different domains without performance loss.
Contribution
It proposes a prompt learning framework with learnable vectors for flexible, multi-style caption generation in a unified model, surpassing heuristic prompt engineering.
Findings
Achieves controllable captioning with diverse styles.
Performs well on COCO and TextCaps benchmarks.
Maintains high performance across multiple domains.
Abstract
Despite the remarkable progress of image captioning, existing captioners typically lack the controllable capability to generate desired image captions, e.g., describing the image in a rough or detailed manner, in a factual or emotional view, etc. In this paper, we show that a unified model is qualified to perform well in diverse domains and freely switch among multiple styles. Such a controllable capability is achieved by embedding the prompt learning into the image captioning framework. To be specific, we design a set of prompts to fine-tune the pre-trained image captioner. These prompts allow the model to absorb stylized data from different domains for joint training, without performance degradation in each domain. Furthermore, we optimize the prompts with learnable vectors in the continuous word embedding space, avoiding the heuristic prompt engineering and meanwhile exhibiting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
