Sparsifying Transformer Models with Trainable Representation Pooling

Micha{\l} Pietruszka; {\L}ukasz Borchmann; {\L}ukasz Garncarek

arXiv:2009.05169·cs.CL·March 8, 2022·1 cites

Sparsifying Transformer Models with Trainable Representation Pooling

Micha{\l} Pietruszka, {\L}ukasz Borchmann, {\L}ukasz Garncarek

PDF

Open Access 1 Repo

TL;DR

This paper introduces a trainable pooling method to sparsify attention in Transformer models, significantly reducing computational complexity while maintaining high performance on long document summarization tasks.

Contribution

It presents a novel trainable top-k operator for attention sparsification, enabling faster training and inference without sacrificing model quality.

Findings

01

Achieved sublinear complexity in attention computation.

02

Maintained state-of-the-art performance on long document summarization.

03

Reduced training time by 1.8x and inference time by 4.5x.

Abstract

We propose a novel method to sparsify attention in the Transformer model by learning to select the most-informative token representations during the training process, thus focusing on the task-specific parts of an input. A reduction of quadratic time and memory complexity to sublinear was achieved due to a robust trainable top- $k$ operator. Our experiments on a challenging long document summarization task show that even our simple baseline performs comparably to the current SOTA, and with trainable pooling, we can retain its top quality, while being $1.8 \times$ faster during training, $4.5 \times$ faster during inference, and up to $13 \times$ more computationally efficient in the decoder.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

applicaai/pyramidions
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Handwritten Text Recognition Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Cosine Annealing · Linear Warmup With Cosine Annealing · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Weight Decay · Sparse Transformer · Layer Normalization