Sparser, Faster, Lighter Transformer Language Models

Edoardo Cetin; Stefano Peluchetti; Emilio Castillo; Akira Naruse; Mana Murakami; Llion Jones

arXiv:2603.23198·cs.LG·May 11, 2026

Sparser, Faster, Lighter Transformer Language Models

Edoardo Cetin, Stefano Peluchetti, Emilio Castillo, Akira Naruse, Mana Murakami, Llion Jones

PDF

TL;DR

This paper introduces a method to make large language models sparser, faster, and lighter by using unstructured sparsity and specialized GPU kernels, significantly improving efficiency with minimal performance loss.

Contribution

The authors develop a new sparse packing format and CUDA kernels that enable efficient sparse computation in LLMs, demonstrating high sparsity levels with negligible impact on performance.

Findings

01

Over 99% sparsity achieved with minimal performance impact

02

Significant throughput, energy, and memory benefits demonstrated

03

Open-source code and kernels released for community adoption

Abstract

Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM's feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a new sparse packing format and a set of CUDA kernels designed to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.