Reflective Decoding Network for Image Captioning

Lei Ke; Wenjie Pei; Ruiyu Li; Xiaoyong Shen; Yu-Wing Tai

arXiv:1908.11824·cs.CV·September 2, 2019·6 cites

Reflective Decoding Network for Image Captioning

Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, Yu-Wing Tai

PDF

Open Access

TL;DR

This paper introduces the Reflective Decoding Network (RDN), a novel image captioning model that enhances language coherence and positional awareness to generate more accurate and contextually rich captions, especially in complex scenes.

Contribution

The paper proposes RDN, which improves caption quality by jointly modeling visual features, language coherence, and word position, advancing beyond existing methods focused mainly on visual features.

Findings

01

RDN outperforms previous methods on COCO dataset.

02

The approach is especially effective for complex scene descriptions.

03

Enhanced long-sequence dependency modeling improves caption coherence.

Abstract

State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence between words and syntactic paradigm of sentences are also important to generate high-quality image caption. Following the conventional encoder-decoder framework, we propose the Reflective Decoding Network (RDN) for image captioning, which enhances both the long-sequence dependency and position perception of words in a caption decoder. Our model learns to collaboratively attend on both visual and textual features and meanwhile perceive each word's relative position in the sentence to maximize the information delivered in the generated caption. We evaluate the effectiveness of our RDN on the COCO image captioning datasets and achieve superior…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition