TL;DR
STAR introduces a unified semantic-temporal framework for few-shot action recognition, effectively aligning visual and textual cues and modeling multi-scale temporal dynamics to improve recognition accuracy.
Contribution
It proposes novel modules for semantic alignment and temporal modeling, integrating large language models and attention mechanisms to enhance few-shot action recognition.
Findings
Achieves up to 8.1% accuracy improvement on SSv2-Full dataset.
Demonstrates consistent superiority over state-of-the-art methods across five benchmarks.
Validates effectiveness with significant gains under limited supervision.
Abstract
Few-shot action recognition (FSAR) requires models to generalize to novel action categories from only a handful of annotated samples. Despite progress with vision-language models, existing approaches still suffer from semantic-temporal misalignment, where static textual prompts fail to capture decisive visual cues that appear sparsely across sequences, and from inadequate modeling of multi-scale temporal dynamics, as short-term discriminative cues and long-range dependencies are often either oversmoothed or fragmented. To address these challenges, we propose Semantic Temporal Adaptive Representation Learning (STAR), a unified framework, consisting of a semantic-alignment component and a temporal-aware component, effectively bridging the semantic and temporal gaps and transferring the sequence modeling capability of Mamba into the FSAR. The semantic alignment module introduces a Temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
