Boosting Weakly-Supervised Temporal Action Localization with Text Information
Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Xiaoyu Wang, Xinbo, Gao

TL;DR
This paper introduces a novel approach leveraging text descriptions to improve weakly-supervised temporal action localization by enhancing discriminative and generative objectives, leading to state-of-the-art results.
Contribution
The paper proposes a new Text-Segment Mining mechanism and a Video-text Language Completion objective to better utilize text information in WTAL, improving localization accuracy.
Findings
Achieved state-of-the-art performance on THUMOS14 and ActivityNet1.3.
Method can be seamlessly integrated into existing WTAL models.
Significant performance improvements demonstrated across benchmarks.
Abstract
Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Natural Language Processing Techniques
