TL;DR
Vec-LUT introduces a vectorized lookup paradigm for ultra-low-bit LLM inference on edge devices, significantly improving memory bandwidth utilization and performance over existing methods.
Contribution
The paper proposes the vector LUT approach, including new tensor layout and cache-aware techniques, to enhance parallel ultra-low-bit LLM inference efficiency.
Findings
Vec-LUT outperforms state-of-the-art baselines by up to 4.2x
Implemented in llama.cpp and tested on 5 edge devices with 3 LLMs
Reduces memory bandwidth underutilization in LUT-based inference.
Abstract
Large language models (LLMs) are increasingly deployed on edge devices. To meet strict resource constraints, real-world deployment has pushed LLM quantization from 8-bit to 4-bit, 2-bit, and now 1.58-bit. Combined with lookup table (LUT)-based inference, CPUs run these ultra-low-bit LLMs even faster than NPUs, opening new opportunities for ubiquitous on-device intelligence. However, this paper identifies that LUT-based inference underutilizes memory bandwidth during parallel inference, which is required for prefilling, test-time scaling, and other multi-token scenarios. The root cause is the scalar LUT paradigm, which performs repetitive and non-contiguous memory accesses for each token. To solve the issue, we propose vector LUT, a new lookup paradigm that constructs a unified LUT across parallel tokens, and performs a single lookup per index. To realize it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
