CS-Mixer: A Cross-Scale Vision MLP Model with Spatial-Channel Mixing
Jonathan Cui, David A. Araujo, Suman Saha, Md. Faisal Kabir

TL;DR
CS-Mixer is a hierarchical Vision MLP that effectively models cross-scale spatial and channel interactions, achieving high accuracy on ImageNet with efficient computation.
Contribution
It introduces a novel cross-scale spatial-channel mixing mechanism in a hierarchical Vision MLP architecture, enhancing performance without high computational costs.
Findings
Achieves 83.2% top-1 accuracy on ImageNet-1k
Uses only 13.7 GFLOPs and 94M parameters
Outperforms previous Vision MLP models
Abstract
Despite their simpler information fusion designs compared with Vision Transformers and Convolutional Neural Networks, Vision MLP architectures have demonstrated strong performance and high data efficiency in recent research. However, existing works such as CycleMLP and Vision Permutator typically model spatial information in equal-size spatial regions and do not consider cross-scale spatial interactions. Further, their token mixers only model 1- or 2-axis correlations, avoiding 3-axis spatial-channel mixing due to its computational demands. We therefore propose CS-Mixer, a hierarchical Vision MLP that learns dynamic low-rank transformations for spatial-channel mixing through cross-scale local and global aggregation. The proposed methodology achieves competitive results on popular image recognition benchmarks without incurring substantially more compute. Our largest model, CS-Mixer-L,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Processing Techniques and Applications · CCD and CMOS Imaging Sensors
