Magic Pyramid: Accelerating Inference with Early Exiting and Token   Pruning

Xuanli He; Iman Keivanloo; Yi Xu; Xiang He; Belinda Zeng; Santosh; Rajagopalan; Trishul Chilimbi

arXiv:2111.00230·cs.CL·November 2, 2021·5 cites

Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning

Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh, Rajagopalan, Trishul Chilimbi

PDF

Open Access

TL;DR

Magic Pyramid (MP) accelerates inference in Transformer models like BERT by combining token pruning and early exiting, reducing computation significantly while maintaining high accuracy, thus enabling faster NLP applications.

Contribution

The paper introduces Magic Pyramid, a novel method that jointly employs token pruning and early exiting to accelerate Transformer inference more effectively than previous approaches.

Findings

01

Achieves up to 70% GFLOPs reduction with less than 0.5% accuracy loss.

02

Provides an average of 8.06x speedup on text classification tasks.

03

Surpasses previous methods in speed-adjustable inference performance.

Abstract

Pre-training and then fine-tuning large language models is commonly used to achieve state-of-the-art performance in natural language processing (NLP) tasks. However, most pre-trained models suffer from low inference speed. Deploying such large models to applications with latency constraints is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models, particularly BERT. The former manages to save the computation via removing non-salient tokens, while the latter can fulfill the computation reduction by terminating the inference early before reaching the final layer, if the exiting condition is met. Our empirical studies demonstrate that compared to previous state of arts,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Residual Connection · Layer Normalization · Dense Connections · Adam · Softmax