Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

Yicheng Qiu; Keiji Yanai

arXiv:2604.09164·cs.CV·April 13, 2026

Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection

Yicheng Qiu, Keiji Yanai

PDF

TL;DR

This paper introduces a novel spatial-temporal focal adapter with SSM for improved long-video action detection, addressing feature redundancy and global dependency issues in existing models.

Contribution

It proposes a new framework integrating TB-SSM and an efficient adapter into pre-trained models for better temporal and spatial feature modeling.

Findings

01

Significantly improves localization accuracy on multiple benchmarks.

02

Enhances robustness and global temporal reasoning in long video sequences.

03

Outperforms previous SSM-based and structural methods in experiments.

Abstract

Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.