Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization

Hayat Ullah; Arslan Munir; Oliver Nina

arXiv:2507.06411·cs.CV·July 21, 2025

Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization

Hayat Ullah, Arslan Munir, Oliver Nina

PDF

Open Access

TL;DR

This paper introduces PCL-Former, a hierarchical multi-stage transformer architecture for temporal action localization, which effectively identifies, classifies, and precisely localizes actions in untrimmed videos, outperforming existing methods.

Contribution

The paper proposes a novel hierarchical transformer architecture with dedicated modules for proposal, classification, and localization, advancing the state-of-the-art in temporal action localization.

Findings

01

Outperforms state-of-the-art on THUMOS-14, ActivityNet-1.3, and HACS datasets.

02

Each module's impact validated through ablation studies.

03

Achieves 2.8%, 1.2%, and 4.8% improvements respectively.

Abstract

Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Emotion and Mood Recognition