Sliceformer: Make Multi-head Attention as Simple as Sorting in   Discriminative Tasks

Shen Yuan; Hongteng Xu

arXiv:2310.17683·cs.LG·October 30, 2023·1 cites

Sliceformer: Make Multi-head Attention as Simple as Sorting in Discriminative Tasks

Shen Yuan, Hongteng Xu

PDF

Open Access 1 Repo

TL;DR

Sliceformer introduces a simple sorting-based alternative to multi-head attention in Transformers, reducing complexity and improving efficiency while maintaining or enhancing performance across various discriminative tasks.

Contribution

It proposes a novel slicing-sorting mechanism as a surrogate for multi-head attention, significantly simplifying the Transformer architecture and reducing computational costs.

Findings

01

Achieves comparable or better performance than traditional Transformers.

02

Demonstrates lower memory usage and faster computation.

03

Suppresses mode collapse in data representation.

Abstract

As one of the most popular neural network modules, Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing. The effectiveness of the Transformer is often attributed to its multi-head attention (MHA) mechanism. In this study, we discuss the limitations of MHA, including the high computational complexity due to its ``query-key-value'' architecture and the numerical issue caused by its softmax operation. Considering the above problems and the recent development tendency of the attention layer, we propose an effective and efficient surrogate of the Transformer, called Sliceformer. Our Sliceformer replaces the classic MHA mechanism with an extremely simple ``slicing-sorting'' operation, i.e., projecting inputs linearly to a latent space and sorting them along different feature dimensions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sds-lab/sliceformer
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning and Data Classification · Advanced Neural Network Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Label Smoothing · Residual Connection · Byte Pair Encoding · Weight Decay · WordPiece · Dense Connections · Linear Warmup With Cosine Annealing