TL;DR
FrameExit introduces a conditional early exiting framework that adaptively processes fewer frames for simpler videos, significantly reducing computational costs while maintaining high accuracy in video recognition tasks.
Contribution
It presents a novel cascade of gating modules with on-the-fly supervision to dynamically balance accuracy and efficiency in video recognition.
Findings
Outperforms existing methods on ActivityNet1.3 and mini-kinetics with 1.3× and 2.1× less GFLOPs
Sets new state-of-the-art efficiency on HVU benchmark
Automatically adapts processing based on video complexity
Abstract
In this paper, we propose a conditional early exiting framework for efficient video recognition. While existing works focus on selecting a subset of salient frames to reduce the computation costs, we propose to use a simple sampling strategy combined with conditional early exiting to enable efficient recognition. Our model automatically learns to process fewer frames for simpler videos and more frames for complex ones. To achieve this, we employ a cascade of gating modules to automatically determine the earliest point in processing where an inference is sufficiently reliable. We generate on-the-fly supervision signals to the gates to provide a dynamic trade-off between accuracy and computational cost. Our proposed model outperforms competing methods on three large-scale video benchmarks. In particular, on ActivityNet1.3 and mini-kinetics, we outperform the state-of-the-art efficient…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsEarly exiting using confidence measures
