Efficient Spatial-Temporal Focal Adapter with SSM for Temporal Action Detection
Yicheng Qiu, Keiji Yanai

TL;DR
This paper introduces a novel spatial-temporal focal adapter with SSM for improved long-video action detection, addressing feature redundancy and global dependency issues in existing models.
Contribution
It proposes a new framework integrating TB-SSM and an efficient adapter into pre-trained models for better temporal and spatial feature modeling.
Findings
Significantly improves localization accuracy on multiple benchmarks.
Enhances robustness and global temporal reasoning in long video sequences.
Outperforms previous SSM-based and structural methods in experiments.
Abstract
Temporal human action detection aims to identify and localize action segments within untrimmed videos, serving as a pivotal task in video understanding. Despite the progress achieved by prior architectures like CNN and Transformer models, these continue to struggle with feature redundancy and degraded global dependency modeling capabilities when applied to long video sequences. These limitations severely constrain their scalability in real-world video analysis. State Space Models (SSMs) offer a promising alternative with linear long-term modeling and robust global temporal reasoning capabilities. Rethinking the application of SSMs in temporal modeling, this research constructs a novel framework for video human action detection. Specifically, we introduce the Efficient Spatial-Temporal Focal (ESTF) Adapter into the pre-trained layers. This module integrates the advantages of our proposed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
