Understanding MLP-Mixer as a Wide and Sparse MLP

Tomohiro Hayase; Ryo Karakida

arXiv:2306.01470·cs.LG·May 8, 2024·1 cites

Understanding MLP-Mixer as a Wide and Sparse MLP

Tomohiro Hayase, Ryo Karakida

PDF

Open Access 3 Reviews

TL;DR

This paper reveals that the success of MLP-Mixer architectures stems from their effective wide and sparse structure, which can be understood through Kronecker-product weights and implicit sparse regularization, enhancing performance.

Contribution

It provides a novel theoretical understanding of MLP-Mixer as a wide, sparse MLP, connecting architecture to sparsity properties and empirical performance improvements.

Findings

01

MLP-Mixers can be expressed as wider MLPs with Kronecker-product weights.

02

The architecture induces implicit sparse regularization.

03

Empirical evidence shows similarity to unstructured sparse-weight MLPs.

Abstract

Multi-layer perceptron (MLP) is a fundamental component of deep learning, and recent MLP-based architectures, especially the MLP-Mixer, have achieved significant empirical success. Nevertheless, our understanding of why and how the MLP-Mixer outperforms conventional MLPs remains largely unexplored. In this work, we reveal that sparseness is a key mechanism underlying the MLP-Mixers. First, the Mixers have an effective expression as a wider MLP with Kronecker-product weights, clarifying that the Mixers efficiently embody several sparseness properties explored in deep learning. In the case of linear layers, the effective expression elucidates an implicit sparse regularization caused by the model architecture and a hidden relation to Monarch matrices, which is also known as another form of sparse parameterization. Next, for general cases, we empirically demonstrate quantitative…

Peer Reviews

Decision·ICML 2024 Poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The derivation and observation seems to be solid. 2. I believe this is the first time the equivalence between MLP mixer and wide MLP has been formalized.

Weaknesses

1. The paper is not very easy to follow. For one, a lot of notations are not defined, which require the reader to find out from the original MLP mixer paper. Examples are eq (1), eq (2). The plots are also hard to interpret, and more explanation could be better. The general structure of the paper could also be improved, to have a more coherent story. For example, section 3.2 could be merged with section 5. 2. The contribution is weak. For example, while it's good to formalize the relationship be

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The author has conducted extensive parameter analysis to validate that the mixing layer of both the MLP-Mixer and the RP Mixer effectively represents a wider MLP. 2. The author offers a new analytical perspective to elucidate the effectiveness of the MLP-Mixer.

Weaknesses

1. The paper falls short in terms of the selection of networks for comparison, thereby resulting in a lack of theoretical support. 2. There is a lack of experimental evidence to support the memory efficiency and lightweight structure of the RP-Mixer. 3. According to (Magnus and Neudecker, 2019), there appear to be slight mistakes in the theoretical proof section. For instance, formula $J_c^{\top}\left(I_S \otimes V\right) J_c=V^{\top} \otimes I_S$ should actually be $J_c^{\top}\left(I_S \otimes

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The idea of understanding MLP-Mixer as a wide and sparse MLP is original. This paper presents an analysis of the MLP-Mixer method, provides a good explanation to connect MLP-Mixer with the Kronecker product and shows that the model behaves as a wide MLP with sparse weights. It is a novel explanation to attribute the success of the Mixer architecture to the effective width of a sparse MLP. Experimental results show that the wide and sparse MLPs could achieve comparable results as the MLP-Mixer ar

Weaknesses

1. Performance comparison of the MLP-Mixer with Wide MLP and RP-Mixer is missing. It would be nice to add the inference speed comparison and memory consumption comparison among the three methods (MLP-Mixer, Wide MLP, RP-Mixer). 2. The absolute results are a bit low on both CIFAR (84.1% baseline for Mixer) and ImageNet-1k (76.4% baseline for Mixer). It makes the improvements less convincing as 0.3 percent boost on ImageNet could easily be caused by many different reasons (augmentation, hyper-par

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and ELM · Brain Tumor Detection and Classification

MethodsAverage Pooling · Layer Normalization · Global Average Pooling · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Dropout · Residual Connection · MLP-Mixer