Searching for Two-Stream Models in Multivariate Space for Video   Recognition

Xinyu Gong; Heng Wang; Zheng Shou; Matt Feiszli; Zhangyang Wang and; Zhicheng Yan

arXiv:2108.12957·cs.CV·August 31, 2021

Searching for Two-Stream Models in Multivariate Space for Video Recognition

Xinyu Gong, Heng Wang, Zheng Shou, Matt Feiszli, Zhangyang Wang and, Zhicheng Yan

PDF

Open Access

TL;DR

This paper introduces an efficient neural architecture search method to automatically discover high-performing two-stream video recognition models, significantly reducing manual design effort and computational costs.

Contribution

The paper proposes a multivariate search space and a progressive search procedure for automatic design of two-stream video models, outperforming manually designed architectures.

Findings

01

Auto-TSNet models outperform existing models on benchmarks.

02

Auto-TSNet-L reduces FLOPS by 11 times while maintaining accuracy.

03

Auto-TSNet-M improves accuracy on Something-Something-V2 with less than 50 GFLOPS.

Abstract

Conventional video models rely on a single stream to capture the complex spatial-temporal features. Recent work on two-stream video models, such as SlowFast network and AssembleNet, prescribe separate streams to learn complementary features, and achieve stronger performance. However, manually designing both streams as well as the in-between fusion blocks is a daunting task, requiring to explore a tremendously large design space. Such manual exploration is time-consuming and often ends up with sub-optimal architectures when computational resources are limited and the exploration is insufficient. In this work, we present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently. We design a multivariate search space, including 6 search variables to capture a wide variety of choices in designing two-stream models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Analysis and Summarization