2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video   Recognition

Hengduo Li; Zuxuan Wu; Abhinav Shrivastava; Larry S. Davis

arXiv:2012.14950·cs.CV·April 30, 2021·5 cites

2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video Recognition

Hengduo Li, Zuxuan Wu, Abhinav Shrivastava, Larry S. Davis

PDF

Open Access

TL;DR

Ada3D adaptively selects frames and convolution layers for each video, reducing computation by 20-50% while maintaining high accuracy in video recognition tasks.

Contribution

Introduces Ada3D, a conditional computation framework that learns instance-specific 3D usage policies for efficient video recognition.

Findings

01

Achieves similar accuracy to state-of-the-art models with less computation.

02

Policies are transferable across different backbones and clip selection methods.

03

Allocates more resources to motion-intensive videos, fewer to static ones.

Abstract

3D convolutional networks are prevalent for video recognition. While achieving excellent recognition performance on standard benchmarks, they operate on a sequence of frames with 3D convolutions and thus are computationally demanding. Exploiting large variations among different videos, we introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network. These policies are derived with a two-head lightweight selection network conditioned on each input video clip. Then, only frames and convolutions that are selected by the selection network are used in the 3D model to generate predictions. The selection network is optimized with policy gradient methods to maximize a reward that encourages making correct predictions with limited computation. We conduct experiments on three video…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications

MethodsConvolution