TL;DR
This paper introduces SOAP, a novel architecture for few-shot action recognition that captures comprehensive spatio-temporal and motion information using frame tuples, achieving state-of-the-art results across multiple benchmarks.
Contribution
Proposes SOAP-Net, a plug-and-play architecture that enhances spatio-temporal relation modeling and motion information capturing in few-shot action recognition.
Findings
Achieves new state-of-the-art performance on SthSthV2, Kinetics, UCF101, and HMDB51.
Demonstrates robustness and generalization across benchmarks.
Shows the effectiveness of frame tuples with diverse frame counts.
Abstract
High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
