MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal   Representation Learning

David Junhao Zhang; Kunchang Li; Yali Wang; Yunpeng Chen; Shashwat; Chandra; Yu Qiao; Luoqi Liu; Mike Zheng Shou

arXiv:2111.12527·cs.CV·August 24, 2022·5 cites

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat, Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou

PDF

Open Access 2 Repos

TL;DR

MorphMLP introduces an efficient, attention-free MLP-like backbone for video and image representation learning, balancing accuracy and computation by leveraging specialized spatial and temporal fully-connected layers.

Contribution

The paper proposes MorphMLP, a novel MLP-like architecture with dedicated spatial and temporal modules, achieving state-of-the-art results with reduced computational cost.

Findings

01

MorphMLP outperforms SOTA models on Kinetics400 and SSV2 benchmarks.

02

MorphMLP significantly reduces GFLOPs compared to recent models.

03

The architecture is effective for both video and image domain tasks.

Abstract

Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections · Softmax · Residual Connection · Adam