VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

Jintian Shao; Hongyi Huang; Jiayi Wu; YiMing Cheng; ZhiYu Wu; You Shan; MingKai Zheng

arXiv:2505.10202·cs.CL·May 16, 2025

VQ-Logits: Compressing the Output Bottleneck of Large Language Models via Vector Quantized Logits

Jintian Shao, Hongyi Huang, Jiayi Wu, YiMing Cheng, ZhiYu Wu, You Shan, MingKai Zheng

PDF

Open Access

TL;DR

VQ-Logits introduces a vector quantization-based method to significantly compress the output layer of large language models, reducing parameters and computation with minimal impact on performance.

Contribution

It proposes a novel VQ-based output layer that replaces large vocab embedding matrices, enabling substantial parameter reduction and faster inference in LLMs.

Findings

01

Achieves up to 99% parameter reduction in output layer

02

Provides 6x speedup in logit computation

03

Maintains only 4% increase in perplexity

Abstract

Large Language Models (LLMs) have achieved remarkable success but face significant computational and memory challenges, particularly due to their extensive output vocabularies. The final linear projection layer, mapping hidden states to vocabulary-sized logits, often constitutes a substantial portion of the model's parameters and computational cost during inference. Existing methods like adaptive softmax or hierarchical softmax introduce structural complexities. In this paper, we propose VQ-Logits, a novel approach that leverages Vector Quantization (VQ) to drastically reduce the parameter count and computational load of the LLM output layer. VQ-Logits replaces the large V * dmodel output embedding matrix with a small, shared codebook of K embedding vectors (K << V ). Each token in the vocabulary is mapped to one of these K codebook vectors. The LLM predicts logits over this compact…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAdaptive Softmax · Hierarchical Softmax · Softmax