Searching for Two-Stream Models in Multivariate Space for Video Recognition
Xinyu Gong, Heng Wang, Zheng Shou, Matt Feiszli, Zhangyang Wang and, Zhicheng Yan

TL;DR
This paper introduces an efficient neural architecture search method to automatically discover high-performing two-stream video recognition models, significantly reducing manual design effort and computational costs.
Contribution
The paper proposes a multivariate search space and a progressive search procedure for automatic design of two-stream video models, outperforming manually designed architectures.
Findings
Auto-TSNet models outperform existing models on benchmarks.
Auto-TSNet-L reduces FLOPS by 11 times while maintaining accuracy.
Auto-TSNet-M improves accuracy on Something-Something-V2 with less than 50 GFLOPS.
Abstract
Conventional video models rely on a single stream to capture the complex spatial-temporal features. Recent work on two-stream video models, such as SlowFast network and AssembleNet, prescribe separate streams to learn complementary features, and achieve stronger performance. However, manually designing both streams as well as the in-between fusion blocks is a daunting task, requiring to explore a tremendously large design space. Such manual exploration is time-consuming and often ends up with sub-optimal architectures when computational resources are limited and the exploration is insufficient. In this work, we present a pragmatic neural architecture search approach, which is able to search for two-stream video models in giant spaces efficiently. We design a multivariate search space, including 6 search variables to capture a wide variety of choices in designing two-stream models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Video Analysis and Summarization
