TL;DR
This paper introduces a length level embedding for image captioning, enabling controllable caption length, and proposes a non-autoregressive model that improves efficiency and diversity, achieving state-of-the-art results on MS COCO.
Contribution
It presents a simple length level embedding method for controllable image captioning and a non-autoregressive model that enhances efficiency and diversity.
Findings
Achieves state-of-the-art performance on MS COCO
Generates controllable and diverse captions
Significantly improves decoding efficiency for long captions
Abstract
The last decade has witnessed remarkable progress in the image captioning task; however, most existing methods cannot control their captions, \emph{e.g.}, choosing to describe the image either roughly or in detail. In this paper, we propose to use a simple length level embedding to endow them with this ability. Moreover, due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows. Thus, we further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity. We verify the merit of the proposed length level embedding on three models: two state-of-the-art (SOTA) autoregressive models with different types of decoder, as well as our proposed non-autoregressive model, to show its generalization ability. In the experiments, our length-controllable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
