
TL;DR
This paper introduces a controllable image captioning framework that generates diverse, high-quality descriptions by leveraging Part-Of-Speech tags and a Transformer network, enhancing interpretability and user control.
Contribution
It presents a novel method that decouples POS dependence, enabling controllable and diverse caption generation with efficient decoding using a Transformer architecture.
Findings
Outperforms state-of-the-art in diversity and quality of captions
Allows user control through POS sequences
Maintains decoding speed proportional to POS vocabulary size
Abstract
State-of-the-art image captioners can generate accurate sentences to describe images in a sequence to sequence manner without considering the controllability and interpretability. This, however, is far from making image captioning widely used as an image can be interpreted in infinite ways depending on the target and the context at hand. Achieving controllability is important especially when the image captioner is used by different people with different way of interpreting the images. In this paper, we introduce a novel framework for image captioning which can generate diverse descriptions by capturing the co-dependence between Part-Of-Speech tags and semantics. Our model decouples direct dependence between successive variables. In this way, it allows the decoder to exhaustively search through the latent Part-Of-Speech choices, while keeping decoding speed proportional to the size of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Absolute Position Encodings · Byte Pair Encoding · Dense Connections · Residual Connection · Dropout · Adam
