End-to-End Dense Video Captioning with Masked Transformer

Luowei Zhou; Yingbo Zhou; Jason J. Corso; Richard Socher; and Caiming; Xiong

arXiv:1804.00819·cs.CV·April 4, 2018·42 cites

End-to-End Dense Video Captioning with Masked Transformer

Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, and Caiming, Xiong

PDF

Open Access 1 Repo

TL;DR

This paper introduces an end-to-end transformer model for dense video captioning that integrates event proposal and captioning tasks, improving accuracy and efficiency by using a differentiable masking mechanism and self-attention.

Contribution

The proposed model is the first to unify event proposal and captioning in an end-to-end transformer framework with a differentiable mask for better consistency.

Findings

01

Achieved METEOR scores of 10.12 on ActivityNet Captions.

02

Achieved METEOR scores of 6.58 on YouCookII.

03

Demonstrated improved performance over previous methods.

Abstract

Dense video captioning aims to generate text descriptions for all events in an untrimmed video. This involves both detecting and describing events. Therefore, all previous methods on dense video captioning tackle this problem by building two models, i.e. an event proposal and a captioning model, for these two sub-problems. The models are either trained separately or in alternation. This prevents direct influence of the language description to the event proposal, which is important for generating accurate descriptions. To address this problem, we propose an end-to-end transformer model for dense video captioning. The encoder encodes the video into appropriate representations. The proposal decoder decodes from the encoding with different anchors to form video event proposals. The captioning decoder employs a masking network to restrict its attention to the proposal event over the encoding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

salesforce/densecap
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax