Reconstruct and Represent Video Contents for Captioning via   Reinforcement Learning

Wei Zhang; Bairui Wang; Lin Ma; Wei Liu

arXiv:1906.01452·cs.CV·June 5, 2019·5 cites

Reconstruct and Represent Video Contents for Captioning via Reinforcement Learning

Wei Zhang, Bairui Wang, Lin Ma, Wei Liu

PDF

Open Access

TL;DR

This paper introduces RecNet, a novel encoder-decoder-reconstructor architecture for video captioning that leverages bidirectional flows and reinforcement learning to improve the quality of generated descriptions.

Contribution

The paper proposes a reconstruction network with bidirectional flows and fusion of local and global video features, enhancing video captioning performance.

Findings

01

Reconstruction network improves captioning accuracy.

02

Bidirectional flow modeling benefits description quality.

03

Reinforcement learning further boosts performance.

Abstract

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) in a novel encoder-decoder-reconstructor architecture, which leverages both forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder component makes use of the forward flow to produce a sentence description based on the encoded video semantic features. Two types of reconstructors are subsequently proposed to employ the backward flow and reproduce the video features from local and global perspectives, respectively, capitalizing on the hidden state sequence generated by the decoder. Moreover, in order to make a comprehensive reconstruction of the video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization