MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

Zhaofan Qiu; Ting Yao; Chong-Wah Ngo; Tao Mei

arXiv:2206.06292·cs.CV·June 14, 2022

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei

PDF

Open Access

TL;DR

MLP-3D introduces a novel MLP-like 3D architecture with grouped time mixing for video recognition, achieving competitive accuracy with fewer computations compared to traditional CNNs and transformers.

Contribution

The paper proposes a new MLP-based 3D architecture with grouped time mixing operations, enabling effective temporal modeling without convolutions or attention.

Findings

01

Achieves 68.5% top-1 accuracy on Something-Something V2

02

Achieves 81.4% top-1 accuracy on Kinetics-400

03

Comparable performance to state-of-the-art 3D CNNs and transformers

Abstract

Convolutional Neural Networks (CNNs) have been regarded as the go-to models for visual recognition. More recently, convolution-free networks, based on multi-head self-attention (MSA) or multi-layer perceptrons (MLPs), become more and more popular. Nevertheless, it is not trivial when utilizing these newly-minted networks for video recognition due to the large variations and complexities in video data. In this paper, we present MLP-3D networks, a novel MLP-like 3D architecture for video recognition. Specifically, the architecture consists of MLP-3D blocks, where each block contains one MLP applied across tokens (i.e., token-mixing MLP) and one MLP applied independently to each token (i.e., channel MLP). By deriving the novel grouped time mixing (GTM) operations, we equip the basic token-mixing MLP with the ability of temporal modeling. GTM divides the input tokens into several temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Image and Signal Denoising Methods · Face and Expression Recognition