Sparsifying Transformer Models with Trainable Representation Pooling
Micha{\l} Pietruszka, {\L}ukasz Borchmann, {\L}ukasz Garncarek

TL;DR
This paper introduces a trainable pooling method to sparsify attention in Transformer models, significantly reducing computational complexity while maintaining high performance on long document summarization tasks.
Contribution
It presents a novel trainable top-k operator for attention sparsification, enabling faster training and inference without sacrificing model quality.
Findings
Achieved sublinear complexity in attention computation.
Maintained state-of-the-art performance on long document summarization.
Reduced training time by 1.8x and inference time by 4.5x.
Abstract
We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top- operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling, we can retain its top quality, while being faster during training, faster during inference, and up to more computationally efficient in the decoder.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Weight Decay · Sparse Transformer · Layer Normalization
