TL;DR
FastFormers introduces a set of techniques combining knowledge distillation, structured pruning, and numerical optimization to significantly improve inference efficiency of Transformer models in NLP tasks, reducing costs and energy consumption.
Contribution
The paper presents practical recipes for enhancing Transformer inference efficiency, achieving up to 234x speed-up on CPU and substantial cost reductions, which were not previously demonstrated.
Findings
Up to 234x speed-up on CPU for NLU tasks.
Cost reduction from $4223 to $18 for serving 100 million requests.
Energy consumption reduced by up to 125.8x.
Abstract
Transformer-based models are the state-of-the-art for Natural Language Understanding (NLU) applications. Models are getting bigger and better on various tasks. However, Transformer models remain computationally challenging since they are not efficient at inference-time compared to traditional approaches. In this paper, we present FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks. We show how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency. We provide effective recipes that can guide practitioners to choose the best settings for various NLU tasks and pretrained models. Applying the proposed recipes to the SuperGLUE benchmark, we achieve from 9.8x up to 233.9x speed-up compared to out-of-the-box models on CPU. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax · Adam · Layer Normalization · Dense Connections · Label Smoothing
