AdapLeR: Speeding up Inference by Adaptive Length Reduction

Ali Modarressi; Hosein Mohebbi; Mohammad Taher Pilehvar

arXiv:2203.08991·cs.CL·March 18, 2022

AdapLeR: Speeding up Inference by Adaptive Length Reduction

Ali Modarressi, Hosein Mohebbi, Mohammad Taher Pilehvar

PDF

1 Repo

TL;DR

AdapLeR dynamically reduces token length in pre-trained language models during inference, significantly speeding up processing with minimal performance loss, by training a contribution predictor to identify less important tokens.

Contribution

This work introduces a novel adaptive token elimination method for BERT that reduces inference time without substantial accuracy degradation.

Findings

01

Achieves up to 22x speedup during inference

02

Maintains high performance on diverse classification tasks

03

Lower false positive rate in token importance detection

Abstract

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance. Our method dynamically eliminates less contributing tokens through layers, resulting in shorter lengths and consequently lower computational cost. To determine the importance of each token representation, we train a Contribution Predictor for each layer using a gradient-based saliency method. Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance. We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amodaresi/adapler
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Dropout