Transkimmer: Transformer Learns to Layer-wise Skim

Yue Guan; Zhengyi Li; Jingwen Leng; Zhouhan Lin; Minyi Guo

arXiv:2205.07324·cs.CL·May 17, 2022

Transkimmer: Transformer Learns to Layer-wise Skim

Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo

PDF

Open Access 1 Repo

TL;DR

Transkimmer introduces a learnable, layer-wise token skimming mechanism in Transformers, significantly reducing computation and speeding up processing while maintaining high accuracy on benchmarks.

Contribution

It proposes an end-to-end trainable skimming method with a parameterized predictor and reparameterization trick for efficient Transformer computation.

Findings

01

Achieves 10.97x speedup on GLUE benchmark

02

Maintains less than 1% accuracy degradation

03

Effectively reduces unnecessary token processing

Abstract

Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chandlerguan/transkimmer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Handwritten Text Recognition Techniques

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dropout · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings