Learning to Discretely Compose Reasoning Module Networks for Video Captioning
Ganchao Tan, Daqing Liu, Meng Wang, Zheng-Jun Zha

TL;DR
This paper introduces Reasoning Module Networks (RMN), a novel approach for video captioning that employs dynamic, discrete reasoning modules to improve explanation and performance in generating natural language descriptions of videos.
Contribution
The paper proposes a new visual reasoning framework with spatio-temporal modules and a discrete module selector, tailored for complex video captioning tasks, advancing beyond existing methods.
Findings
RMN outperforms state-of-the-art methods on MSVD and MSR-VTT datasets.
The approach provides an explicit, explainable reasoning process.
Extensive experiments validate the effectiveness of the proposed modules.
Abstract
Generating natural language descriptions for videos, i.e., video captioning, essentially requires step-by-step reasoning along the generation process. For example, to generate the sentence "a man is shooting a basketball", we need to first locate and describe the subject "man", next reason out the man is "shooting", then describe the object "basketball" of shooting. However, existing visual reasoning methods designed for visual question answering are not appropriate to video captioning, for it requires more complex visual reasoning on videos over both space and time, and dynamic module composition along the generation process. In this paper, we propose a novel visual reasoning approach for video captioning, named Reasoning Module Networks (RMN), to equip the existing encoder-decoder framework with the above reasoning capacity. Specifically, our RMN employs 1) three sophisticated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
