Reconstruction Network for Video Captioning

Bairui Wang; Lin Ma; Wei Zhang; Wei Liu

arXiv:1803.11438·cs.CV·April 2, 2018·41 cites

Reconstruction Network for Video Captioning

Bairui Wang, Lin Ma, Wei Zhang, Wei Liu

PDF

Open Access 3 Repos

TL;DR

This paper introduces RecNet, a novel encoder-decoder-reconstructor architecture for video captioning that uses bidirectional flows to improve caption accuracy by jointly training with generation and reconstruction losses.

Contribution

The paper proposes a reconstruction network with a bidirectional flow approach, enhancing video captioning by jointly optimizing caption generation and video feature reconstruction.

Findings

01

Significant improvement in caption accuracy on benchmark datasets.

02

Reconstruction loss boosts encoder-decoder model performance.

03

Bidirectional flow approach effectively leverages video and sentence information.

Abstract

In this paper, the problem of describing visual contents of a video sequence with natural language is addressed. Unlike previous video captioning work mainly exploiting the cues of video contents to make a language description, we propose a reconstruction network (RecNet) with a novel encoder-decoder-reconstructor architecture, which leverages both the forward (video to sentence) and backward (sentence to video) flows for video captioning. Specifically, the encoder-decoder makes use of the forward flow to produce the sentence description based on the encoded video semantic features. Two types of reconstructors are customized to employ the backward flow and reproduce the video features based on the hidden state sequence generated by the decoder. The generation loss yielded by the encoder-decoder and the reconstruction loss introduced by the reconstructor are jointly drawn into training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization