Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization

Yunchuan Ma; Laiyun Qing; Guorong Li; Yuqing Liu; Yuankai Qi; and Qingming Huang

arXiv:2602.05718·cs.CV·February 6, 2026

Exploring the Temporal Consistency for Point-Level Weakly-Supervised Temporal Action Localization

Yunchuan Ma, Laiyun Qing, Guorong Li, Yuqing Liu, Yuankai Qi, and Qingming Huang

PDF

Open Access

TL;DR

This paper introduces a multi-task learning framework that leverages point supervision and self-supervised tasks to enhance temporal understanding for more accurate point-level weakly-supervised temporal action localization.

Contribution

It is the first to explicitly explore temporal consistency using self-supervised tasks in point-supervised action localization, improving model understanding of temporal relationships.

Findings

01

Outperforms state-of-the-art methods on four benchmarks.

02

Self-supervised tasks improve temporal understanding and localization accuracy.

03

Demonstrates the importance of modeling temporal relationships in weak supervision.

Abstract

Point-supervised Temporal Action Localization (PTAL) adopts a lightly frame-annotated paradigm (\textit{i.e.}, labeling only a single frame per action instance) to train a model to effectively locate action instances within untrimmed videos. Most existing approaches design the task head of models with only a point-supervised snippet-level classification, without explicit modeling of understanding temporal relationships among frames of an action. However, understanding the temporal relationships of frames is crucial because it can help a model understand how an action is defined and therefore benefits localizing the full frames of an action. To this end, in this paper, we design a multi-task learning framework that fully utilizes point supervision to boost the model's temporal understanding capability for action localization. Specifically, we design three self-supervised temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis