MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat, Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou

TL;DR
MorphMLP introduces an efficient, attention-free MLP-like backbone for video and image representation learning, balancing accuracy and computation by leveraging specialized spatial and temporal fully-connected layers.
Contribution
The paper proposes MorphMLP, a novel MLP-like architecture with dedicated spatial and temporal modules, achieving state-of-the-art results with reduced computational cost.
Findings
MorphMLP outperforms SOTA models on Kinetics400 and SSV2 benchmarks.
MorphMLP significantly reduces GFLOPs compared to recent models.
The architecture is effective for both video and image domain tasks.
Abstract
Recently, MLP-Like networks have been revived for image recognition. However, whether it is possible to build a generic MLP-Like architecture on video domain has not been explored, due to complex spatial-temporal modeling with large computation burden. To fill this gap, we present an efficient self-attention free backbone, namely MorphMLP, which flexibly leverages the concise Fully-Connected (FC) layer for video representation learning. Specifically, a MorphMLP block consists of two key layers in sequence, i.e., MorphFC_s and MorphFC_t, for spatial and temporal modeling respectively. MorphFC_s can effectively capture core semantics in each frame, by progressive token interaction along both height and width dimensions. Alternatively, MorphFC_t can adaptively learn long-term dependency over frames, by temporal token aggregation on each spatial location. With such multi-dimension and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Dense Connections · Softmax · Residual Connection · Adam
