Skimming and Scanning for Untrimmed Video Action Recognition
Yunyan Hong, Ailing Zeng, Min Li, Cewu Lu, Li Jiang, Qiang Xu

TL;DR
This paper introduces the Skim-Scan framework for untrimmed video action recognition, employing a divide-and-conquer clip selection strategy inspired by speed reading to improve accuracy and efficiency.
Contribution
It proposes a novel clip-level skimming and scanning approach that adaptively selects informative clips, outperforming existing methods in accuracy and computational efficiency.
Findings
Surpasses state-of-the-art accuracy on ActivityNet and mini-FCVID datasets.
Reduces computational complexity while maintaining high performance.
Effectively filters uninformative clips to focus on essential content.
Abstract
Video action recognition (VAR) is a primary task of video understanding, and untrimmed videos are more common in real-life scenes. Untrimmed videos have redundant and diverse clips containing contextual information, so sampling dense clips is essential. Recently, some works attempt to train a generic model to select the N most representative clips. However, it is difficult to model the complex relations from intra-class clips and inter-class videos within a single model and fixed selected number, and the entanglement of multiple relations is also hard to explain. Thus, instead of "only look once", we argue "divide and conquer" strategy will be more suitable in untrimmed VAR. Inspired by the speed reading mechanism, we propose a simple yet effective clip-level solution based on skim-scan techniques. Specifically, the proposed Skim-Scan framework first skims the entire video and drops…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Diabetic Foot Ulcer Assessment and Management · Video Analysis and Summarization
