Towards Better Multi-head Attention via Channel-wise Sample Permutation
Shen Yuan, Hongteng Xu

TL;DR
This paper introduces a channel-wise sample permutation operator that simplifies multi-head attention in Transformers, reducing parameters and complexity while maintaining or improving performance across vision and language tasks.
Contribution
The study proposes a novel CSP operator that implicitly implements cross-channel attention with fewer parameters and lower complexity, enhancing Transformer efficiency.
Findings
CSP achieves comparable or better performance than traditional MHA.
CSP reduces model parameters and computational costs.
CSP demonstrates effectiveness in vision and language tasks.
Abstract
Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing, whose effectiveness is mainly attributed to its multi-head attention (MHA) mechanism. In this study, we propose a simple and novel channel-wise sample permutation (CSP) operator, achieving a new structured MHA with fewer parameters and lower complexity. Given an input matrix, CSP circularly shifts the samples of different channels with various steps and then sorts grouped samples of each channel. This operator is equivalent to implicitly implementing cross-channel attention maps as permutation matrices, which achieves linear complexity and suppresses the risk of rank collapse when representing data. We replace the MHA of some representative models with CSP and test the CSP-based models in several discriminative tasks, including…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. The proposed operator achieves comparable performance to current Transformer variants with fewer parameters and lower complexity. 2. The experimental comparisons effectively demonstrate performance compared to current state-of-the-art Transformers.
1. The experiments are majorly evaluated on discriminative tasks. 2. According to Tab. 4, the improvement compared to previous approaches is limited. 3. The paper focus on the shift operator. I think it is necessary to discuss with previous shift operator, including those approaches in channel shift or spatial shift. e.g. TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Device.
1. Paper writing is clear and the methodology is more accurately described 2. The CSP module was used with good results, obtaining promising results in several tests.
1. The motivation of the modeling is clearer in the paper, and it can be understood that CSP needs to construct intergroup and intragroup permutations to achieve feature mixing, but there is some confusion in the paper's explanation of the model, and the linkage with the illustration (e.g., Fig. 1) is weak, which makes it more difficult to understand. For example, the specific definition of the substitution matrix $T_{c}^{k}$ and K in eq(5); also, unlike the definition in eq(3), $J_{c}$ seems to
- The presentation of this paper is good. The authors clearly explain the motivation of this paper and give specific implementation details on the proposed method. - The authors provide theoretical analysis to show the advantages of the proposed approach. - Experimental results show that the proposed approach receives better results than previous attention mechanisms on CIFAR, ImageNet classification and long sequence analysis.
- It seems that the proposed approach performs better than other ViT variants as shown in Table 3. However, the authors did not compare the proposed approach with ViTs with other types of attention mechanisms, like the ones shown in Table 2. - According to the appendix, the training recipe used in image classification is not new. It seems that the results in image classification is much less than vision transformers, like DeiT. Many popular training strategies are not used. Have the authors use
Similar to ShuffleNet, this paper presents a new channel mixing method that shuffles channels of a given set of tokens then performs token-wise projection. This mechanism can remove the parameters necessary for explicit token mixing such as Attention, MLP-mixer, and SSM. The authors show that not only it leads to efficient number of parameters, but also have a mathematical strength over Attention that it is robust from a rank collapse. This paper demonstrates that this efficient approach perform
- I mainly concern regarding the uncertainty of whether this method can be generally applicable. For example, as the authors mention in Sec 3.2.2, it is intractable to handle an input with a large number of tokens (e.g., long sequences) using this module, because of the limited number of channel dimension. The authors suggest a solution that incorporate multiple number of layers as a whole for channel shifting, but having two separate algorithms for different settings (N = C and N >> C) involves
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Sparse and Compressive Sensing Techniques · Speech and Audio Processing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Dense Connections · WordPiece · Residual Connection · Linear Warmup With Linear Decay · Dropout · Layer Normalization · Linear Warmup With Cosine Annealing · Adam
