2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition
Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, Larry S. Davis

TL;DR
Ada3D adaptively selects frames and convolution layers for each video, reducing computation by 20-50% while maintaining high accuracy in video recognition tasks.
Contribution
Introduces Ada3D, a conditional computation framework that learns instance-specific 3D usage policies for efficient video recognition.
Findings
Achieves similar accuracy to state-of-the-art models with less computation.
Policies are transferable across different backbones and clip selection methods.
Allocates more resources to motion-intensive videos, fewer to static ones.
Abstract
3D convolutional networks are prevalent for video recognition. While achieving excellent recognition performance on standard benchmarks, they operate on a sequence of frames with 3D convolutions and thus are computationally demanding. Exploiting large variations among different videos, we introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network. These policies are derived with a two-head lightweight selection network conditioned on each input video clip. Then, only frames and convolutions that are selected by the selection network are used in the 3D model to generate predictions. The selection network is optimized with policy gradient methods to maximize a reward that encourages making correct predictions with limited computation. We conduct experiments on three video…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications
MethodsConvolution
