CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference
Dong Liu, Yanxuan Yu, Ben Lengerich

TL;DR
CSV-Decode introduces a geometric bounds-based method to efficiently decode large language models by reducing vocabulary size during inference, ensuring correctness guarantees and significantly speeding up computation.
Contribution
The paper proposes CSV-Decode, a novel technique that constructs small, certifiable sub-vocabularies for efficient large language model inference using geometric bounds and clustering.
Findings
Achieves significant speedup over full vocabulary decoding
Maintains dual correctness guarantees: exact top-k and epsilon-certified softmax
Reduces fallback rates with low approximation errors
Abstract
Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top- certification and -certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
