TL;DR
This paper introduces a simple end-to-end dense video captioning framework called PDVC that formulates captioning as a set prediction task, improving coherence and readability without relying on complex hand-crafted components.
Contribution
The paper proposes a novel parallel decoding framework for dense video captioning that directly predicts event sets, enhancing efficiency and caption quality over prior methods.
Findings
Outperforms state-of-the-art two-stage methods on ActivityNet Captions and YouCook2 datasets.
Effectively segments videos into event pieces with a new event counter module.
Produces high-quality, coherent captions without heuristic post-processing.
Abstract
Dense video captioning aims to generate multiple associated captions with their temporal locations from the video. Previous methods follow a sophisticated "localize-then-describe" scheme, which heavily relies on numerous hand-crafted components. In this paper, we proposed a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC), by formulating the dense caption generation as a set prediction task. In practice, through stacking a newly proposed event counter on the top of a transformer decoder, the PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content, which effectively increases the coherence and readability of predicted captions. Compared with prior arts, the PDVC has several appealing advantages: (1) Without relying on heuristic non-maximum suppression or a recurrent event…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
