TL;DR
PosMLP-Video introduces an efficient MLP-based backbone for video recognition that uses relative positional encoding to model spatio-temporal relations, achieving competitive accuracy with fewer parameters and FLOPs.
Contribution
The paper proposes PosMLP-Video, a lightweight MLP-like model with novel relative positional encoding and gating units for efficient spatio-temporal video modeling.
Findings
Achieves 59.0% top-1 accuracy on Something-Something V1
Achieves 70.3% top-1 accuracy on Something-Something V2
Achieves 82.1% top-1 accuracy on Kinetics-400
Abstract
In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks. However, their inherently dense computational operators, such as self-attention and token-mixing layers, pose significant challenges when applied to spatio-temporal video data. To address this gap, we propose PosMLP-Video, a lightweight yet powerful MLP-like backbone for video recognition. Instead of dense operators, we use efficient relative positional encoding (RPE) to build pairwise token relations, leveraging small-sized parameterized relative position biases to obtain each relation score. Specifically, to enable spatio-temporal modeling, we extend the image PosMLP's positional gating unit to temporal, spatial, and spatio-temporal variants, namely PoTGU, PoSGU, and PoSTGU, respectively. These gating units can be feasibly combined into three types of spatio-temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
