Learning to Discretely Compose Reasoning Module Networks for Video   Captioning

Ganchao Tan; Daqing Liu; Meng Wang; Zheng-Jun Zha

arXiv:2007.09049·cs.CV·July 20, 2020·6 cites

Learning to Discretely Compose Reasoning Module Networks for Video Captioning

Ganchao Tan, Daqing Liu, Meng Wang, Zheng-Jun Zha

PDF

Open Access 1 Repo

TL;DR

This paper introduces Reasoning Module Networks (RMN), a novel approach for video captioning that employs dynamic, discrete reasoning modules to improve explanation and performance in generating natural language descriptions of videos.

Contribution

The paper proposes a new visual reasoning framework with spatio-temporal modules and a discrete module selector, tailored for complex video captioning tasks, advancing beyond existing methods.

Findings

01

RMN outperforms state-of-the-art methods on MSVD and MSR-VTT datasets.

02

The approach provides an explicit, explainable reasoning process.

03

Extensive experiments validate the effectiveness of the proposed modules.

Abstract

Generating natural language descriptions for videos, i.e., video captioning, essentially requires step-by-step reasoning along the generation process. For example, to generate the sentence "a man is shooting a basketball", we need to first locate and describe the subject "man", next reason out the man is "shooting", then describe the object "basketball" of shooting. However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process. In this paper, we propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with the above reasoning capacity. Specifically, our RMN employs 1) three sophisticated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tgc1997/RMN
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization