Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware   Masking

Marco Federici; Davide Belli; Mart van Baalen; Amir Jalalirad; Andrii; Skliar; Bence Major; Markus Nagel; Paul Whatmough

arXiv:2412.01380·cs.LG·April 4, 2025

Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking

Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii, Skliar, Bence Major, Markus Nagel, Paul Whatmough

PDF

Open Access

TL;DR

This paper introduces Dynamic Input Pruning (DIP), a novel method for reducing memory and increasing throughput in large language model inference on mobile devices by dynamically sparsifying inputs without predictor training.

Contribution

The work presents DIP, a predictor-free dynamic sparsification technique that maintains accuracy with minimal fine-tuning and introduces cache-aware masking to enhance cache utilization during LLM inference.

Findings

01

DIP achieves 46% memory reduction and 40% throughput increase on Phi-3-Medium.

02

DIP maintains less than 0.1 perplexity loss compared to dense streaming.

03

DIP outperforms previous sparsification methods in accuracy, memory, and throughput.

Abstract

While mobile devices provide ever more compute power, improvements in DRAM bandwidth are much slower. This is unfortunate for large language model (LLM) token generation, which is heavily memory-bound. Previous work has proposed to leverage natural dynamic activation sparsity in ReLU-activated LLMs to reduce effective DRAM bandwidth per token. However, more recent LLMs use SwiGLU instead of ReLU, which results in little inherent sparsity. While SwiGLU activations can be pruned based on magnitude, the resulting sparsity patterns are difficult to predict, rendering previous approaches ineffective. To circumvent this issue, our work introduces Dynamic Input Pruning (DIP): a predictor-free dynamic sparsification approach, which preserves accuracy with minimal fine-tuning. DIP can further use lightweight LoRA adapters to regain some performance lost during sparsification. Lastly, we describe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Data Compression Techniques · Algorithms and Data Compression

Methods*Communicated@Fast*How Do I Communicate to Expedia? · SwiGLU · Pruning