Towards Diverse Temporal Grounding under Single Positive Labels
Hao Zhou, Chongyang Zhang, Yanjun Chen, Chuanping Hu

TL;DR
This paper introduces a novel framework for temporal grounding in videos that accounts for multiple moments described by a single query, using positive moment estimation and diverse regression to improve retrieval accuracy.
Contribution
It reformulates temporal grounding as a one-vs-many problem, proposing the DTG-SPL framework with modules for positive moment estimation and diverse moment regression.
Findings
Outperforms existing methods on Charades-STA and ActivityNet Captions datasets.
Effectively mines potential positive moments to handle multiple relevant video segments.
Achieves superior results in both single-label and multi-label evaluation metrics.
Abstract
Temporal grounding aims to retrieve moments of the described event within an untrimmed video by a language query. Typically, existing methods assume annotations are precise and unique, yet one query may describe multiple moments in many cases. Hence, simply taking it as a one-vs-one mapping task and striving to match single-label annotations will inevitably introduce false negatives during optimization. In this study, we reformulate this task as a one-vs-many optimization problem under the condition of single positive labels. The unlabeled moments are considered unobserved rather than negative, and we explore mining potential positive moments to assist in multiple moment retrieval. In this setting, we propose a novel Diverse Temporal Grounding framework, termed DTG-SPL, which mainly consists of a positive moment estimation (PME) module and a diverse moment regression (DMR) module. PME…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
