CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

Dong Liu; Yanxuan Yu; Ben Lengerich

arXiv:2511.21702·cs.CL·December 1, 2025

CSV-Decode: Certifiable Sub-Vocabulary Decoding for Efficient Large Language Model Inference

Dong Liu, Yanxuan Yu, Ben Lengerich

PDF

Open Access

TL;DR

CSV-Decode introduces a geometric bounds-based method to efficiently decode large language models by reducing vocabulary size during inference, ensuring correctness guarantees and significantly speeding up computation.

Contribution

The paper proposes CSV-Decode, a novel technique that constructs small, certifiable sub-vocabularies for efficient large language model inference using geometric bounds and clustering.

Findings

01

Achieves significant speedup over full vocabulary decoding

02

Maintains dual correctness guarantees: exact top-k and epsilon-certified softmax

03

Reduces fallback rates with low approximation errors

Abstract

Large language models face significant computational bottlenecks during inference due to the expensive output layer computation over large vocabularies. We present CSV-Decode, a novel approach that uses geometric upper bounds to construct small sub-vocabularies for each decoding step, enabling efficient sparse computation while maintaining dual correctness guarantees: exact top- $k$ certification and $ε$ -certified softmax approximations. Our method clusters vocabulary embeddings offline and uses centroid-plus-radius bounds to identify which tokens can be safely omitted from computation. We provide a complete system implementation with sparse GEMV kernels, multi-GPU sharding, and CUDA Graph optimization. Experimental results demonstrate significant speedup over full vocabulary decoding while maintaining distributional guarantees and low fallback rates. Our code implementation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications