FastFormers: Highly Efficient Transformer Models for Natural Language   Understanding

Young Jin Kim; Hany Hassan Awadalla

arXiv:2010.13382·cs.CL·October 27, 2020

FastFormers: Highly Efficient Transformer Models for Natural Language Understanding

Young Jin Kim, Hany Hassan Awadalla

PDF

2 Repos

TL;DR

FastFormers introduces a set of techniques combining knowledge distillation, structured pruning, and numerical optimization to significantly improve inference efficiency of Transformer models in NLP tasks, reducing costs and energy consumption.

Contribution

The paper presents practical recipes for enhancing Transformer inference efficiency, achieving up to 234x speed-up on CPU and substantial cost reductions, which were not previously demonstrated.

Findings

01

Up to 234x speed-up on CPU for NLU tasks.

02

Cost reduction from $4223 to $18 for serving 100 million requests.

03

Energy consumption reduced by up to 125.8x.

Abstract

Transformer-based models are the state-of-the-art for Natural Language Understanding (NLU) applications. Models are getting bigger and better on various tasks. However, Transformer models remain computationally challenging since they are not efficient at inference-time compared to traditional approaches. In this paper, we present FastFormers, a set of recipes to achieve efficient inference-time performance for Transformer-based models on various NLU tasks. We show how carefully utilizing knowledge distillation, structured pruning and numerical optimization can lead to drastic improvements on inference efficiency. We provide effective recipes that can guide practitioners to choose the best settings for various NLU tasks and pretrained models. Applying the proposed recipes to the SuperGLUE benchmark, we achieve from 9.8x up to 233.9x speed-up compared to out-of-the-box models on CPU. On…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Softmax · Adam · Layer Normalization · Dense Connections · Label Smoothing