LLM Vocabulary Compression for Low-Compute Environments

Sreeram Vennam; Anish Joishy; Ponnurangam Kumaraguru

arXiv:2411.06371·cs.CL·November 12, 2024

LLM Vocabulary Compression for Low-Compute Environments

Sreeram Vennam, Anish Joishy, Ponnurangam Kumaraguru

PDF

Open Access

TL;DR

This paper introduces a vocabulary compression method for language models that reduces memory usage and increases throughput, enabling efficient deployment in low-compute settings without sacrificing performance.

Contribution

The authors propose a novel token grouping technique based on BPE merges to compress the final linear layer of language models, improving efficiency.

Findings

01

Memory usage reduced by up to 3.4x

02

Throughput increased by up to 3x

03

Maintains performance comparable to GPT-Neo and GPT2

Abstract

We present a method to compress the final linear layer of language models, reducing memory usage by up to 3.4x without significant performance loss. By grouping tokens based on Byte Pair Encoding (BPE) merges, we prevent materialization of the memory-intensive logits tensor. Evaluations on the TinyStories dataset show that our method performs on par with GPT-Neo and GPT2 while significantly improving throughput by up to 3x, making it suitable for low-compute environments.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques

MethodsLinear Layer · GPT-Neo