Transkimmer: Transformer Learns to Layer-wise Skim
Yue Guan, Zhengyi Li, Jingwen Leng, Zhouhan Lin, Minyi Guo

TL;DR
Transkimmer introduces a learnable, layer-wise token skimming mechanism in Transformers, significantly reducing computation and speeding up processing while maintaining high accuracy on benchmarks.
Contribution
It proposes an end-to-end trainable skimming method with a parameterized predictor and reparameterization trick for efficient Transformer computation.
Findings
Achieves 10.97x speedup on GLUE benchmark
Maintains less than 1% accuracy degradation
Effectively reduces unnecessary token processing
Abstract
Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Handwritten Text Recognition Techniques
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Dropout · Adam · Byte Pair Encoding · Residual Connection · Label Smoothing · Position-Wise Feed-Forward Layer · Absolute Position Encodings
