Multi-task Learning with Extended Temporal Shift Module for Temporal Action Localization
Anh-Kiet Duong, Petra Gomez-Kr\"amer

TL;DR
This paper introduces an enhanced multi-task learning approach using an extended Temporal Shift Module for improved temporal action localization across diverse multi-modal video datasets, achieving top rankings in the ICCV 2025 challenge.
Contribution
The paper extends the Temporal Shift Module for TAL by adding background classification and multi-task learning, combined with ensemble strategies, to improve localization accuracy in multi-modal videos.
Findings
Ranked first in ICCV 2025 challenge
Effective multi-task learning framework for TAL
Improved robustness through ensemble modeling
Abstract
We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Robot Manipulation and Learning
