Controllable Image Captioning

Luka Maxwell

arXiv:2204.13324·cs.CV·May 26, 2022

Controllable Image Captioning

Luka Maxwell

PDF

Open Access

TL;DR

This paper introduces a controllable image captioning framework that generates diverse, high-quality descriptions by leveraging Part-Of-Speech tags and a Transformer network, enhancing interpretability and user control.

Contribution

It presents a novel method that decouples POS dependence, enabling controllable and diverse caption generation with efficient decoding using a Transformer architecture.

Findings

01

Outperforms state-of-the-art in diversity and quality of captions

02

Allows user control through POS sequences

03

Maintains decoding speed proportional to POS vocabulary size

Abstract

State-of-the-art image captioners can generate accurate sentences to describe images in a sequence to sequence manner without considering the controllability and interpretability. This, however, is far from making image captioning widely used as an image can be interpreted in infinite ways depending on the target and the context at hand. Achieving controllability is important especially when the image captioner is used by different people with different way of interpreting the images. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics. Our model decouples direct dependence between successive variables. In this way, it allows the decoder to exhaustively search through the latent Part-Of-Speech choices, while keeping decoding speed proportional to the size of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Byte Pair Encoding · Dense Connections · Residual Connection · Dropout · Adam