PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning
Yizhe Li, Sanping Zhou, Zheng Qin, Le Wang

TL;DR
PR-DETR enhances dense video captioning by explicitly incorporating position and relation priors into a transformer framework, leading to improved localization accuracy and caption quality in untrimmed videos.
Contribution
It introduces position-anchored queries and an event relation encoder to explicitly model event locations and relationships, surpassing implicit learning methods.
Findings
Achieves superior localization and captioning performance on ActivityNet Captions.
Demonstrates the effectiveness of position and relation priors through ablation studies.
Outperforms existing methods on benchmark datasets.
Abstract
Dense video captioning is a challenging task that aims to localize and caption multiple events in an untrimmed video. Recent studies mainly follow the transformer-based architecture to jointly perform the two sub-tasks, i.e., event localization and caption generation, in an end-to-end manner. Based on the general philosophy of detection transformer, these methods implicitly learn the event locations and event semantics, which requires a large amount of training data and limits the model's performance in practice. In this paper, we propose a novel dense video captioning framework, named PR-DETR, which injects the explicit position and relation prior into the detection transformer to improve the localization accuracy and caption quality, simultaneously. On the one hand, we first generate a set of position-anchored queries to provide the scene-specific position and semantic information…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
