Recurrent Relational Memory Network for Unsupervised Image Captioning

Dan Guo; Yang Wang; Peipei Song; Meng Wang

arXiv:2006.13611·cs.CV·June 25, 2020

Recurrent Relational Memory Network for Unsupervised Image Captioning

Dan Guo, Yang Wang, Peipei Song, Meng Wang

PDF

Open Access

TL;DR

This paper introduces R^2M, a memory-based network for unsupervised image captioning that avoids GANs, uses a two-stage memory mechanism for relational reasoning, and outperforms existing methods on benchmarks.

Contribution

The paper presents a novel R^2M network that replaces GANs with a memory-based approach for unsupervised image captioning, improving efficiency and performance.

Findings

01

R^2M outperforms state-of-the-art methods on benchmark datasets.

02

The proposed model has fewer parameters and higher computational efficiency.

03

It effectively encodes visual context and learns from textual corpora without supervision.

Abstract

Unsupervised image captioning with no annotations is an emerging challenge in computer vision, where the existing arts usually adopt GAN (Generative Adversarial Networks) models. In this paper, we propose a novel memory-based network rather than GAN, named Recurrent Relational Memory Network ( $R^{2} M$ ). Unlike complicated and sensitive adversarial learning that non-ideally performs for long sentence generation, $R^{2} M$ implements a concepts-to-sentence memory translator through two-stage memory mechanisms: fusion and recurrent memories, correlating the relational reasoning between common visual concepts and the generated words for long periods. $R^{2} M$ encodes visual context through unsupervised training on images, while enabling the memory to learn from irrelevant textual corpus via supervised fashion. Our solution enjoys less learnable parameters and higher computational efficiency than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsMemory Network