MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing
Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei

TL;DR
MLP-3D introduces a novel MLP-like 3D architecture with grouped time mixing for video recognition, achieving competitive accuracy with fewer computations compared to traditional CNNs and transformers.
Contribution
The paper proposes a new MLP-based 3D architecture with grouped time mixing operations, enabling effective temporal modeling without convolutions or attention.
Findings
Achieves 68.5% top-1 accuracy on Something-Something V2
Achieves 81.4% top-1 accuracy on Kinetics-400
Comparable performance to state-of-the-art 3D CNNs and transformers
Abstract
Convolutional Neural Networks (CNNs) have been regarded as the go-to models for visual recognition. More recently, convolution-free networks, based on multi-head self-attention (MSA) or multi-layer perceptrons (MLPs), become more and more popular. Nevertheless, it is not trivial when utilizing these newly-minted networks for video recognition due to the large variations and complexities in video data. In this paper, we present MLP-3D networks, a novel MLP-like 3D architecture for video recognition. Specifically, the architecture consists of MLP-3D blocks, where each block contains one MLP applied across tokens (i.e., token-mixing MLP) and one MLP applied independently to each token (i.e., channel MLP). By deriving the novel grouped time mixing (GTM) operations, we equip the basic token-mixing MLP with the ability of temporal modeling. GTM divides the input tokens into several temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Image and Signal Denoising Methods · Face and Expression Recognition
