Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization

Anh-Kiet Duong; Petra Gomez-Kr\"amer

arXiv:2512.11189·cs.CV·December 15, 2025

Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization

Anh-Kiet Duong, Petra Gomez-Kr\"amer

PDF

Open Access

TL;DR

This paper introduces an enhanced multi-task learning approach using an extended Temporal Shift Module for improved temporal action localization across diverse multi-modal video datasets, achieving top rankings in the ICCV 2025 challenge.

Contribution

The paper extends the Temporal Shift Module for TAL by adding background classification and multi-task learning, combined with ensemble strategies, to improve localization accuracy in multi-modal videos.

Findings

01

Ranked first in ICCV 2025 challenge

02

Effective multi-task learning framework for TAL

03

Improved robustness through ensemble modeling

Abstract

We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Robot Manipulation and Learning