Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment
Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi, and Qingming Huang

TL;DR
This paper introduces a novel framework that enhances point-supervised temporal action localization by integrating textual descriptions through refinement and alignment modules, significantly improving localization accuracy.
Contribution
The proposed TRA framework uniquely combines visual and textual features using new modules, enabling better semantic understanding and more precise action localization in videos.
Findings
Outperforms state-of-the-art methods on five benchmarks.
Effectively reduces modality gap via contrastive learning.
Operates efficiently on a single high-end GPU.
Abstract
Recently, point-supervised temporal action localization has gained significant attention for its effective balance between labeling costs and localization accuracy. However, current methods only consider features from visual inputs, neglecting helpful semantic information from the text side. To address this issue, we propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich. This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA). Specifically, we first generate descriptions for video frames using a pre-trained multimodal model. Next, PTR refines the initial descriptions by leveraging point annotations together with multiple…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Human Motion and Animation
