Diverse Image Captioning with Grounded Style
Franz Klein, Shweta Mahajan, Stefan Roth

TL;DR
This paper introduces a novel approach for stylized image captioning that incorporates visual scene content into style generation, using attribute-based data augmentation and a variational autoencoder to produce diverse, grounded captions.
Contribution
It proposes a new method combining attribute-based augmentation and a structured variational autoencoder to generate diverse, style-grounded image captions.
Findings
Effective in generating diverse stylized captions
Captions are accurately grounded in visual content
Improves over prior style-only captioning methods
Abstract
Stylized image captioning as presented in prior work aims to generate captions that reflect characteristics beyond a factual description of the scene composition, such as sentiments. Such prior work relies on given sentiment identifiers, which are used to express a certain global style in the caption, e.g. positive or negative, however without taking into account the stylistic content of the visual scene. To address this shortcoming, we first analyze the limitations of current stylized captioning datasets and propose COCO attribute-based augmentations to obtain varied stylized captions from COCO annotations. Furthermore, we encode the stylized information in the latent space of a Variational Autoencoder; specifically, we leverage extracted image attributes to explicitly structure its sequential latent space according to different localized style characteristics. Our experiments on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Image Retrieval and Classification Techniques
