PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning

Yizhe Li; Sanping Zhou; Zheng Qin; Le Wang

arXiv:2506.16082·cs.CV·June 23, 2025

PR-DETR: Injecting Position and Relation Prior for Dense Video Captioning

Yizhe Li, Sanping Zhou, Zheng Qin, Le Wang

PDF

Open Access

TL;DR

PR-DETR enhances dense video captioning by explicitly incorporating position and relation priors into a transformer framework, leading to improved localization accuracy and caption quality in untrimmed videos.

Contribution

It introduces position-anchored queries and an event relation encoder to explicitly model event locations and relationships, surpassing implicit learning methods.

Findings

01

Achieves superior localization and captioning performance on ActivityNet Captions.

02

Demonstrates the effectiveness of position and relation priors through ablation studies.

03

Outperforms existing methods on benchmark datasets.

Abstract

Dense video captioning is a challenging task that aims to localize and caption multiple events in an untrimmed video. Recent studies mainly follow the transformer-based architecture to jointly perform the two sub-tasks, i.e., event localization and caption generation, in an end-to-end manner. Based on the general philosophy of detection transformer, these methods implicitly learn the event locations and event semantics, which requires a large amount of training data and limits the model's performance in practice. In this paper, we propose a novel dense video captioning framework, named PR-DETR, which injects the explicit position and relation prior into the detection transformer to improve the localization accuracy and caption quality, simultaneously. On the one hand, we first generate a set of position-anchored queries to provide the scene-specific position and semantic information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques