Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences
Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin, Jin

TL;DR
This paper introduces FS-StyleCap, a novel framework for few-shot stylized visual captioning that generates styled image and video descriptions using minimal style examples without additional training.
Contribution
The paper presents a new few-shot approach for stylized captioning that does not require style-labeled datasets or retraining, enabling flexible style generation from limited examples.
Findings
Outperforms state-of-the-art in automatic sentiment captioning
Achieves comparable results to fully supervised models
Human evaluations confirm multi-style handling ability
Abstract
Stylized visual captioning aims to generate image or video descriptions with specific styles, making them more attractive and emotionally appropriate. One major challenge with this task is the lack of paired stylized captions for visual content, so most existing works focus on unsupervised methods that do not rely on parallel datasets. However, these approaches still require training with sufficient examples that have style labels, and the generated captions are limited to predefined styles. To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module. Our two-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Subtitles and Audiovisual Media
MethodsFocus
