Easy and Efficient Transformer : Scalable Inference Solution For large NLP model
Gongzheng Li, Yadong Xi, Jingzhen Ding, Duan Wang, Bai Liu, Changjie, Fan, Xiaoxi Mao, Zeng Zhao

TL;DR
This paper introduces Easy and Efficient Transformer (EET), a scalable inference solution that significantly reduces inference costs and improves speed for large NLP models through optimized kernels and memory management.
Contribution
EET provides novel algorithm and implementation-level optimizations, including optimized kernels and a flexible CUDA memory manager, enabling faster and more efficient transformer inference.
Findings
Achieves 1.40-4.20x speedup over Faster Transformer v4.0
Reduces memory footprint during large model deployment
Effective for long inputs and large hidden sizes
Abstract
Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill such a gap, we introduce a scalable inference solution: Easy and Efficient Transformer (EET), including a series of transformer inference optimization at the algorithm and implementation levels. First, we design highly optimized kernels for long inputs and large hidden sizes. Second, we propose a flexible CUDA memory manager to reduce the memory footprint when deploying a large model. Compared with the state-of-the-art transformer inference library (Faster Transformer v4.0), EET can achieve an average of 1.40-4.20x speedup on the transformer decoder layer with an A100 GPU
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Neural Networks and Applications · Machine Learning and Data Classification
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Refunds@Expedia|||How do I get a full refund from Expedia? · Cosine Annealing · Softmax · Attention Dropout · Linear Warmup With Cosine Annealing · Layer Normalization
