Diverse Image Captioning with Grounded Style

Franz Klein; Shweta Mahajan; Stefan Roth

arXiv:2205.01813·cs.CV·May 5, 2022

Diverse Image Captioning with Grounded Style

Franz Klein, Shweta Mahajan, Stefan Roth

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel approach for stylized image captioning that incorporates visual scene content into style generation, using attribute-based data augmentation and a variational autoencoder to produce diverse, grounded captions.

Contribution

It proposes a new method combining attribute-based augmentation and a structured variational autoencoder to generate diverse, style-grounded image captions.

Findings

01

Effective in generating diverse stylized captions

02

Captions are accurately grounded in visual content

03

Improves over prior style-only captioning methods

Abstract

Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are used to express a certain global style in the caption, e.g. positive or negative, however without taking into account the stylistic content of the visual scene. To address this shortcoming, we first analyze the limitations of current stylized captioning datasets and propose COCO attribute-based augmentations to obtain varied stylized captions from COCO annotations. Furthermore, we encode the stylized information in the latent space of a Variational Autoencoder; specifically, we leverage extracted image attributes to explicitly structure its sequential latent space according to different localized style characteristics. Our experiments on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

visinf/style-seqcvae
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Image Retrieval and Classification Techniques