KDEformer: Accelerating Transformers via Kernel Density Estimation
Amir Zandieh, Insu Han, Majid Daliri, Amin Karbasi

TL;DR
KDEformer introduces a novel approach to approximate the attention mechanism in transformers using kernel density estimation, significantly reducing computation time while maintaining accuracy.
Contribution
The paper presents KDEformer, a method that accelerates transformer attention by reducing it to a KDE problem with provable spectral bounds, outperforming prior methods.
Findings
Achieves over 4x speedup in BigGAN image generation.
Provides over 18x speedup in ImageNet classification with minimal accuracy loss.
Outperforms existing attention approximation methods in accuracy, memory, and runtime.
Abstract
Dot-product attention mechanism plays a crucial role in modern deep architectures (e.g., Transformer) for sequence modeling, however, na\"ive exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Medical Image Segmentation Techniques
MethodsAttention Is All You Need · BigGAN · *Communicated@Fast*How Do I Communicate to Expedia? · Non-Local Operation · Batch Normalization · Feedforward Network · 1x1 Convolution · GAN Hinge Loss · Adam · Projection Discriminator
