End-to-End Dense Video Captioning with Parallel Decoding

Teng Wang; Ruimao Zhang; Zhichao Lu; Feng Zheng; Ran Cheng; Ping Luo

arXiv:2108.07781·cs.CV·November 18, 2021

End-to-End Dense Video Captioning with Parallel Decoding

Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, Ping Luo

PDF

2 Repos

TL;DR

This paper introduces a simple end-to-end dense video captioning framework called PDVC that formulates captioning as a set prediction task, improving coherence and readability without relying on complex hand-crafted components.

Contribution

The paper proposes a novel parallel decoding framework for dense video captioning that directly predicts event sets, enhancing efficiency and caption quality over prior methods.

Findings

01

Outperforms state-of-the-art two-stage methods on ActivityNet Captions and YouCook2 datasets.

02

Effectively segments videos into event pieces with a new event counter module.

03

Produces high-quality, coherent captions without heuristic post-processing.

Abstract

Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated "localize-then-describe" scheme, which heavily relies on numerous hand-crafted components. In this paper, we proposed a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. In practice, through stacking a newly proposed event counter on the top of a transformer decoder, the PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content, which effectively increases the coherence and readability of predicted captions. Compared with prior arts, the PDVC has several appealing advantages: (1) Without relying on heuristic non-maximum suppression or a recurrent event…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.