Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment

Yunchuan Ma; Laiyun Qing; Guorong Li; Yuqing Liu; Yuankai Qi; and Qingming Huang

arXiv:2602.01257·cs.CV·February 3, 2026

Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment

Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi, and Qingming Huang

PDF

Open Access

TL;DR

This paper introduces a novel framework that enhances point-supervised temporal action localization by integrating textual descriptions through refinement and alignment modules, significantly improving localization accuracy.

Contribution

The proposed TRA framework uniquely combines visual and textual features using new modules, enabling better semantic understanding and more precise action localization in videos.

Findings

01

Outperforms state-of-the-art methods on five benchmarks.

02

Effectively reduces modality gap via contrastive learning.

03

Operates efficiently on a single high-end GPU.

Abstract

Recently, point-supervised temporal action localization has gained significant attention for its effective balance between labeling costs and localization accuracy. However, current methods only consider features from visual inputs, neglecting helpful semantic information from the text side. To address this issue, we propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich. This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA). Specifically, we first generate descriptions for video frames using a pre-trained multimodal model. Next, PTR refines the initial descriptions by leveraging point annotations together with multiple…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Human Motion and Animation