SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text
Alexander Mathews, Lexing Xie, Xuming He

TL;DR
SemStyle is a novel model that generates visually relevant, styled image captions from unaligned text by separating semantics and style, leveraging large corpora of styled language.
Contribution
It introduces a new approach to stylized caption generation that does not require aligned image-caption pairs, using a semantic representation and a unified language model.
Findings
Captions preserve image semantics and are style shifted.
Automatic and manual evaluations confirm relevance and style adaptation.
Model leverages large unaligned styled text corpora.
Abstract
Linguistic style is an essential part of written communication, with the power to affect both clarity and attractiveness. With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. Existing approaches either require styled training captions aligned to images or generate captions with low relevance. We develop a model that learns to generate visually relevant styled captions from a large corpus of styled text without aligned images. The core idea of this model, called SemStyle, is to separate semantics and style. One key component is a novel and concise semantic term representation generated using natural language processing techniques and frame semantics. In addition, we develop a unified language model that decodes sentences with diverse word choices and syntax for different…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Language, Metaphor, and Cognition
