LLM Vocabulary Compression for Low-Compute Environments
Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru

TL;DR
This paper introduces a vocabulary compression method for language models that reduces memory usage and increases throughput, enabling efficient deployment in low-compute settings without sacrificing performance.
Contribution
The authors propose a novel token grouping technique based on BPE merges to compress the final linear layer of language models, improving efficiency.
Findings
Memory usage reduced by up to 3.4x
Throughput increased by up to 3x
Maintains performance comparable to GPT-Neo and GPT2
Abstract
We present a method to compress the final linear layer of language models, reducing memory usage by up to 3.4x without significant performance loss. By grouping tokens based on Byte Pair Encoding (BPE) merges, we prevent materialization of the memory-intensive logits tensor. Evaluations on the TinyStories dataset show that our method performs on par with GPT-Neo and GPT2 while significantly improving throughput by up to 3x, making it suitable for low-compute environments.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsLinear Layer · GPT-Neo
