TBT-Former: Learning Temporal Boundary Distributions for Action Localization

Thisara Rathnayaka; Uthayasanker Thayasivam

arXiv:2512.01298·cs.CV·December 2, 2025

TBT-Former: Learning Temporal Boundary Distributions for Action Localization

Thisara Rathnayaka, Uthayasanker Thayasivam

PDF

Open Access

TL;DR

TBT-Former introduces a Transformer-based architecture for more accurate temporal action localization by enhancing feature extraction, multi-scale fusion, and boundary uncertainty modeling, achieving state-of-the-art results on key datasets.

Contribution

It proposes a novel boundary distribution regression head and a multi-scale feature pyramid, improving localization accuracy over existing models.

Findings

01

Achieves new state-of-the-art performance on THUMOS14 and EPIC-Kitchens 100 datasets.

02

Effectively models boundary uncertainty with a probability distribution approach.

03

Enhances temporal feature extraction with a scaled Transformer backbone.

Abstract

Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or "fuzzy" temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis