Contrast-Unity for Partially-Supervised Temporal Sentence Grounding

Haicheng Wang; Chen Ju; Weixiong Lin; Chaofan Ma; Shuai Xiao; Ya; Zhang; Yanfeng Wang

arXiv:2502.12917·cs.CV·February 19, 2025

Contrast-Unity for Partially-Supervised Temporal Sentence Grounding

Haicheng Wang, Chen Ju, Weixiong Lin, Chaofan Ma, Shuai Xiao, Ya, Zhang, Yanfeng Wang

PDF

Open Access

TL;DR

This paper introduces a contrast-unity framework for partially-supervised temporal sentence grounding, leveraging limited clip annotations to improve event detection in videos through a two-stage implicit-explicit grounding process.

Contribution

It proposes a novel contrast-unity approach that effectively utilizes partial labels with a two-stage implicit-explicit training strategy for temporal grounding.

Findings

01

Achieves superior performance on Charades-STA and ActivityNet Captions datasets.

02

Demonstrates the effectiveness of partial supervision over weakly-supervised methods.

03

Validates the importance of contrastive learning in fine-grained event-query alignment.

Abstract

Temporal sentence grounding aims to detect event timestamps described by the natural language query from given untrimmed videos. The existing fully-supervised setting achieves great results but requires expensive annotation costs; while the weakly-supervised setting adopts cheap labels but performs poorly. To pursue high performance with less annotation costs, this paper introduces an intermediate partially-supervised setting, i.e., only short-clip is available during training. To make full use of partial labels, we specially design one contrast-unity framework, with the two-stage goal of implicit-explicit progressive grounding. In the implicit stage, we align event-query representations at fine granularity using comprehensive quadruple contrastive learning: event-query gather, event-background separation, intra-cluster compactness and inter-cluster separability. Then, high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsALIGN