TBT-Former: Learning Temporal Boundary Distributions for Action Localization
Thisara Rathnayaka, Uthayasanker Thayasivam

TL;DR
TBT-Former introduces a Transformer-based architecture for more accurate temporal action localization by enhancing feature extraction, multi-scale fusion, and boundary uncertainty modeling, achieving state-of-the-art results on key datasets.
Contribution
It proposes a novel boundary distribution regression head and a multi-scale feature pyramid, improving localization accuracy over existing models.
Findings
Achieves new state-of-the-art performance on THUMOS14 and EPIC-Kitchens 100 datasets.
Effectively models boundary uncertainty with a probability distribution approach.
Enhances temporal feature extraction with a scaled Transformer backbone.
Abstract
Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or "fuzzy" temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
