End-to-End Dense Video Grounding via Parallel Regression

Fengyuan Shi; Weilin Huang; Limin Wang

arXiv:2109.11265·cs.CV·February 29, 2024·5 cites

End-to-End Dense Video Grounding via Parallel Regression

Fengyuan Shi, Weilin Huang, Limin Wang

PDF

Open Access

TL;DR

This paper introduces PRVG, an end-to-end parallel regression model using a Transformer-like architecture for dense video grounding, capable of localizing multiple moments in untrimmed videos with paragraph inputs efficiently and accurately.

Contribution

The paper proposes a novel parallel regression framework for dense video grounding that simplifies the process and improves performance over existing proposal-based methods.

Findings

01

PRVG significantly outperforms previous methods on ActivityNet Captions and TACoS benchmarks.

02

The proposed approach enables efficient inference without post-processing.

03

The parallel regression paradigm effectively handles dense video grounding tasks.

Abstract

Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. Existing methods often address this task in an indirect way, by casting it as a proposal-and-match or fusion-and-detection problem. Solving these surrogate problems often requires sophisticated label assignment during training and hand-crafted removal of near-duplicate results. Meanwhile, existing works typically focus on sparse video grounding with a single sentence as input, which could result in ambiguous localization due to its unclear description. In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input. From a perspective on video grounding as language conditioned regression, we present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG). The key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning