Llamas on the Web: Memory-Efficient, Performance-Portable, and Multi-Precision LLM Inference with WebGPU
Reese Levine, Rithik Sharma, Nikhil Jain, Abhijit Ramesh, Zheyuan Chen, Neha Abbas, James Contini, Tyler Sorensen

TL;DR
LlamaWeb is a WebGPU backend for llama.cpp that enables memory-efficient, portable, and multi-precision LLM inference in browsers, supporting diverse hardware and formats with significant performance improvements.
Contribution
It introduces a memory-optimized, extensible WebGPU backend for llama.cpp that supports multiple quantization formats and cross-device variability, enabling efficient browser-based LLM inference.
Findings
LlamaWeb reduces memory usage by 29-33% across devices.
It increases decode throughput by 45-69% on various GPUs.
LlamaWeb is competitive with or outperforms vendor-specific backends.
Abstract
Running language models in the browser presents a unique opportunity to build efficient, private, and portable AI applications, but requires contending with constrained memory availability and heterogeneous hardware targets. To realize this opportunity, we present Llamas on the Web (LlamaWeb), a WebGPU backend for llamacpp that enables memory-efficient and performance-portable LLM inference across a wide range of model weight formats in the browser. Our design significantly reduces memory overhead through static memory planning and efficient model loading, addresses cross-device variability through a tunable kernel library, and introduces templated GPU kernels that support performant implementations of numerous quantization formats, enabling broad model support and extensibility to new formats. We evaluate LlamaWeb on 16 devices from 8 vendors, collecting data from 10 language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
