Magic Pyramid: Accelerating Inference with Early Exiting and Token Pruning
Xuanli He, Iman Keivanloo, Yi Xu, Xiang He, Belinda Zeng, Santosh, Rajagopalan, Trishul Chilimbi

TL;DR
Magic Pyramid (MP) accelerates inference in Transformer models like BERT by combining token pruning and early exiting, reducing computation significantly while maintaining high accuracy, thus enabling faster NLP applications.
Contribution
The paper introduces Magic Pyramid, a novel method that jointly employs token pruning and early exiting to accelerate Transformer inference more effectively than previous approaches.
Findings
Achieves up to 70% GFLOPs reduction with less than 0.5% accuracy loss.
Provides an average of 8.06x speedup on text classification tasks.
Surpasses previous methods in speed-adjustable inference performance.
Abstract
Pre-training and then fine-tuning large language models is commonly used to achieve state-of-the-art performance in natural language processing (NLP) tasks. However, most pre-trained models suffer from low inference speed. Deploying such large models to applications with latency constraints is challenging. In this work, we focus on accelerating the inference via conditional computations. To achieve this, we propose a novel idea, Magic Pyramid (MP), to reduce both width-wise and depth-wise computation via token pruning and early exiting for Transformer-based models, particularly BERT. The former manages to save the computation via removing non-salient tokens, while the latter can fulfill the computation reduction by terminating the inference early before reaching the final layer, if the exiting condition is met. Our empirical studies demonstrate that compared to previous state of arts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Pruning · Linear Layer · Residual Connection · Layer Normalization · Dense Connections · Adam · Softmax
