Easy and Efficient Transformer : Scalable Inference Solution For large   NLP model

Gongzheng Li; Yadong Xi; Jingzhen Ding; Duan Wang; Bai Liu; Changjie; Fan; Xiaoxi Mao; Zeng Zhao

arXiv:2104.12470·cs.CL·May 25, 2022

Easy and Efficient Transformer : Scalable Inference Solution For large NLP model

Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie, Fan, Xiaoxi Mao, Zeng Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces Easy and Efficient Transformer (EET), a scalable inference solution that significantly reduces inference costs and improves speed for large NLP models through optimized kernels and memory management.

Contribution

EET provides novel algorithm and implementation-level optimizations, including optimized kernels and a flexible CUDA memory manager, enabling faster and more efficient transformer inference.

Findings

01

Achieves 1.40-4.20x speedup over Faster Transformer v4.0

02

Reduces memory footprint during large model deployment

03

Effective for long inputs and large hidden sizes

Abstract

Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill such a gap, we introduce a scalable inference solution: Easy and Efficient Transformer (EET), including a series of transformer inference optimization at the algorithm and implementation levels. First, we design highly optimized kernels for long inputs and large hidden sizes. Second, we propose a flexible CUDA memory manager to reduce the memory footprint when deploying a large model. Compared with the state-of-the-art transformer inference library (Faster Transformer v4.0), EET can achieve an average of 1.40-4.20x speedup on the transformer decoder layer with an A100 GPU

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NetEase-FuXi/EET
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Machine Learning and Data Classification

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Softmax · Attention Dropout · Linear Warmup With Cosine Annealing · Layer Normalization