Accelerating Large Language Models through Partially Linear Feed-Forward   Network

Gansen Hu; Zhaoguo Wang; Jinglin Wei; Wei Huang; Haibo Chen

arXiv:2501.10054·cs.LG·January 31, 2025

Accelerating Large Language Models through Partially Linear Feed-Forward Network

Gansen Hu, Zhaoguo Wang, Jinglin Wei, Wei Huang, Haibo Chen

PDF

Open Access

TL;DR

TARDIS is a novel method that accelerates large language models by approximating non-linear activations with linear functions in common input ranges, enabling significant parameter reduction and speedup with minimal accuracy loss.

Contribution

The paper introduces TARDIS, a technique for partially linearizing activations in LLMs, allowing for effective parameter reduction and faster inference while maintaining high accuracy.

Findings

01

Achieves 80% parameter reduction in feed-forward networks.

02

Outperforms state-of-the-art pruning methods Wanda and RIA with up to 65% higher accuracy.

03

Provides 1.6x and 1.4x inference speedup in practical deployments.

Abstract

Large language models (LLMs) demonstrate remarkable capabilities but face deployment challenges due to their massive parameter counts. While existing compression techniques like pruning can reduce model size, it leads to significant accuracy degradation under high compression ratios. We present a novel perspective inspired by constant folding in compiler optimization. Our approach enables parameter reduction by treating activation functions in LLMs as linear functions. However, recent LLMs use complex non-linear activations like GELU that prevent direct application of this technique. We propose TARDIS, which enables optimization of LLMs with non-linear activations by partially approximating them with linear functions in frequently occurring input ranges. For outlier inputs, TARDIS employs an online predictor to dynamically fall back to original computations. Our experiments…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Pruning