Visual Captioning at Will: Describing Images and Videos Guided by a Few   Stylized Sentences

Dingyi Yang; Hongyu Chen; Xinglin Hou; Tiezheng Ge; Yuning Jiang; Qin; Jin

arXiv:2307.16399·cs.MM·August 1, 2023

Visual Captioning at Will: Describing Images and Videos Guided by a Few Stylized Sentences

Dingyi Yang, Hongyu Chen, Xinglin Hou, Tiezheng Ge, Yuning Jiang, Qin, Jin

PDF

Open Access

TL;DR

This paper introduces FS-StyleCap, a novel framework for few-shot stylized visual captioning that generates styled image and video descriptions using minimal style examples without additional training.

Contribution

The paper presents a new few-shot approach for stylized captioning that does not require style-labeled datasets or retraining, enabling flexible style generation from limited examples.

Findings

01

Outperforms state-of-the-art in automatic sentiment captioning

02

Achieves comparable results to fully supervised models

03

Human evaluations confirm multi-style handling ability

Abstract

Stylized visual captioning aims to generate image or video descriptions with specific styles, making them more attractive and emotionally appropriate. One major challenge with this task is the lack of paired stylized captions for visual content, so most existing works focus on unsupervised methods that do not rely on parallel datasets. However, these approaches still require training with sufficient examples that have style labels, and the generated captions are limited to predefined styles. To address these limitations, we explore the problem of Few-Shot Stylized Visual Captioning, which aims to generate captions in any desired style, using only a few examples as guidance during inference, without requiring further training. We propose a framework called FS-StyleCap for this task, which utilizes a conditional encoder-decoder language model and a visual projection module. Our two-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Subtitles and Audiovisual Media

MethodsFocus