Otter: Mitigating Background Distractions of Wide-Angle Few-Shot Action Recognition with Enhanced RWKV
Wenbo Huang, Jinghui Zhang, Zhenghao Chen, Guang Li, Lei Zhang, Yang Cao, Fang Dong, Takahiro Ogawa, Miki Haseyama

TL;DR
Otter introduces novel segmentation and temporal reconstruction modules to improve wide-angle few-shot action recognition by emphasizing subjects and reconstructing temporal relations, achieving state-of-the-art results.
Contribution
The paper proposes Otter, a method combining Compound Segmentation and Temporal Reconstruction modules to enhance subject focus and temporal modeling in wide-angle FSAR.
Findings
Achieves state-of-the-art performance on SSv2, Kinetics, UCF101, and HMDB51.
Outperforms existing methods on the VideoBadminton dataset.
Effectively highlights subjects and reconstructs temporal relations in wide-angle videos.
Abstract
Wide-angle videos in few-shot action recognition (FSAR) effectively express actions within specific scenarios. However, without a global understanding of both subjects and background, recognizing actions in such samples remains challenging because of the background distractions. Receptance Weighted Key Value (RWKV), which learns interaction between various dimensions, shows promise for global modeling. While directly applying RWKV to wide-angle FSAR may fail to highlight subjects due to excessive background information. Additionally, temporal relation degraded by frames with similar backgrounds is difficult to reconstruct, further impacting performance. Therefore, we design the CompOund SegmenTation and Temporal REconstructing RWKV (Otter). Specifically, the Compound Segmentation Module~(CSM) is devised to segment and emphasize key patches in each frame, effectively highlighting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Domain Adaptation and Few-Shot Learning · Hand Gesture Recognition Systems
