Auto-X3D: Ultra-Efficient Video Understanding via Finer-Grained Neural Architecture Search
Yifan Jiang, Xinyu Gong, Junru Wu, Humphrey Shi, Zhicheng Yan,, Zhangyang Wang

TL;DR
Auto-X3D introduces a fine-grained neural architecture search for ultra-efficient video recognition models, significantly improving accuracy and reducing computational costs compared to prior methods.
Contribution
It directly searches in a large, fine-grained 3D architecture space using probabilistic NAS, surpassing previous coarse searches like X3D.
Findings
Outperforms existing models by up to 1.3% accuracy on benchmarks.
Reduces computational cost by up to 1.74 times at similar accuracy.
Demonstrates effectiveness on Kinetics and Something-Something-V2 datasets.
Abstract
Efficient video architecture is the key to deploying video recognition systems on devices with limited computing resources. Unfortunately, existing video architectures are often computationally intensive and not suitable for such applications. The recent X3D work presents a new family of efficient video models by expanding a hand-crafted image architecture along multiple axes, such as space, time, width, and depth. Although operating in a conceptually large space, X3D searches one axis at a time, and merely explored a small set of 30 architectures in total, which does not sufficiently explore the space. This paper bypasses existing 2D architectures, and directly searched for 3D architectures in a fine-grained space, where block type, filter number, expansion ratio and attention block are jointly searched. A probabilistic neural architecture search method is adopted to efficiently search…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Human Pose and Action Recognition · Video Surveillance and Tracking Methods
