Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks
Shen Yuan, Hongteng Xu

TL;DR
Sliceformer introduces a simple sorting-based alternative to multi-head attention in Transformers, reducing complexity and improving efficiency while maintaining or enhancing performance across various discriminative tasks.
Contribution
It proposes a novel slicing-sorting mechanism as a surrogate for multi-head attention, significantly simplifying the Transformer architecture and reducing computational costs.
Findings
Achieves comparable or better performance than traditional Transformers.
Demonstrates lower memory usage and faster computation.
Suppresses mode collapse in data representation.
Abstract
As one of the most popular neural network modules, Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing. The effectiveness of the Transformer is often attributed to its multi-head attention (MHA) mechanism. In this study, we discuss the limitations of MHA, including the high computational complexity due to its ``query-key-value'' architecture and the numerical issue caused by its softmax operation. Considering the above problems and the recent development tendency of the attention layer, we propose an effective and efficient surrogate of the Transformer, called Sliceformer. Our Sliceformer replaces the classic MHA mechanism with an extremely simple ``slicing-sorting'' operation, i.e., projecting inputs linearly to a latent space and sorting them along different feature dimensions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Advanced Neural Network Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Label Smoothing · Residual Connection · Byte Pair Encoding · Weight Decay · WordPiece · Dense Connections · Linear Warmup With Cosine Annealing
