AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video   Recognition

Yulin Wang; Yang Yue; Yuanze Lin; Haojun Jiang; Zihang Lai; Victor; Kulikov; Nikita Orlov; Humphrey Shi; Gao Huang

arXiv:2112.14238·cs.CV·April 13, 2022

AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

Yulin Wang, Yang Yue, Yuanze Lin, Haojun Jiang, Zihang Lai, Victor, Kulikov, Nikita Orlov, Humphrey Shi, Gao Huang

PDF

Open Access 2 Repos

TL;DR

AdaFocus V2 introduces an end-to-end trainable, efficient spatial dynamic network for video recognition, improving accuracy and training simplicity over the original AdaFocus by reformulating its training process.

Contribution

It presents a differentiable, one-stage training method for AdaFocus, along with an improved training scheme and a conditional-exit technique for adaptive computation.

Findings

01

Outperforms original AdaFocus and baselines on six datasets.

02

Significantly more efficient and easier to train.

03

Achieves better accuracy with reduced training complexity.

Abstract

Recent works have shown that the computational efficiency of video recognition can be significantly improved by reducing the spatial redundancy. As a representative work, the adaptive focus method (AdaFocus) has achieved a favorable trade-off between accuracy and inference speed by dynamically identifying and attending to the informative regions in each video frame. However, AdaFocus requires a complicated three-stage training pipeline (involving reinforcement learning), leading to slow convergence and is unfriendly to practitioners. This work reformulates the training of AdaFocus as a simple one-stage algorithm by introducing a differentiable interpolation-based patch selection operation, enabling efficient end-to-end optimization. We further present an improved training scheme to address the issues introduced by the one-stage formulation, including the lack of supervision, input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Reservoir Computing · Brain Tumor Detection and Classification · Reinforcement Learning in Robotics

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings