FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference
Zirui Liu, Qingquan Song, Qiang Charles Xiao, Sathiya Keerthi, Selvaraj, Rahul Mazumder, Aman Gupta, and Xia Hu

TL;DR
FFSplit is a novel method that splits the feed-forward network in language models based on neuron activation patterns, significantly improving efficiency while maintaining accuracy, enabling deployment on resource-limited hardware.
Contribution
The paper introduces a neuron-splitting technique for FFNs that enhances the accuracy-efficiency trade-off in language model inference.
Findings
Reduces model size by 43.1%.
Achieves 1.25 to 1.56 times speedup in inference.
Maintains negligible accuracy loss.
Abstract
The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model's size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Neural Network Applications
