End-to-End Dense Video Grounding via Parallel Regression
Fengyuan Shi, Weilin Huang, Limin Wang

TL;DR
This paper introduces PRVG, an end-to-end parallel regression model using a Transformer-like architecture for dense video grounding, capable of localizing multiple moments in untrimmed videos with paragraph inputs efficiently and accurately.
Contribution
The paper proposes a novel parallel regression framework for dense video grounding that simplifies the process and improves performance over existing proposal-based methods.
Findings
PRVG significantly outperforms previous methods on ActivityNet Captions and TACoS benchmarks.
The proposed approach enables efficient inference without post-processing.
The parallel regression paradigm effectively handles dense video grounding tasks.
Abstract
Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. Existing methods often address this task in an indirect way, by casting it as a proposal-and-match or fusion-and-detection problem. Solving these surrogate problems often requires sophisticated label assignment during training and hand-crafted removal of near-duplicate results. Meanwhile, existing works typically focus on sparse video grounding with a single sentence as input, which could result in ambiguous localization due to its unclear description. In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input. From a perspective on video grounding as language conditioned regression, we present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG). The key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
