Accelerating Large Language Models through Partially Linear Feed-Forward Network
Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen

TL;DR
TARDIS is a novel method that accelerates large language models by approximating non-linear activations with linear functions in common input ranges, enabling significant parameter reduction and speedup with minimal accuracy loss.
Contribution
The paper introduces TARDIS, a technique for partially linearizing activations in LLMs, allowing for effective parameter reduction and faster inference while maintaining high accuracy.
Findings
Achieves 80% parameter reduction in feed-forward networks.
Outperforms state-of-the-art pruning methods Wanda and RIA with up to 65% higher accuracy.
Provides 1.6x and 1.4x inference speedup in practical deployments.
Abstract
Large language models (LLMs) demonstrate remarkable capabilities but face deployment challenges due to their massive parameter counts. While existing compression techniques like pruning can reduce model size, it leads to significant accuracy degradation under high compression ratios. We present a novel perspective inspired by constant folding in compiler optimization. Our approach enables parameter reduction by treating activation functions in LLMs as linear functions. However, recent LLMs use complex non-linear activations like GELU that prevent direct application of this technique. We propose TARDIS, which enables optimization of LLMs with non-linear activations by partially approximating them with linear functions in frequently occurring input ranges. For outlier inputs, TARDIS employs an online predictor to dynamically fall back to original computations. Our experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Pruning
