Scaling Embedding Layers in Language Models

Da Yu; Edith Cohen; Badih Ghazi; Yangsibo Huang; Pritish Kamath; Ravi Kumar; Daogao Liu; Chiyuan Zhang

arXiv:2502.01637·cs.CL·October 27, 2025

Scaling Embedding Layers in Language Models

Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang

PDF

Open Access

TL;DR

The paper introduces SCONE, a scalable n-gram embedding method that improves language model performance by adding contextualized embeddings without increasing inference costs, enabling effective scaling of model size and embeddings.

Contribution

SCONE provides a novel approach to extend embedding layers with n-gram embeddings, allowing scalable model improvements while maintaining low inference latency.

Findings

01

Scaling n-gram embeddings improves model performance.

02

Model with 1B parameters outperforms 1.9B baseline.

03

Inference cost remains low despite scaling.

Abstract

We propose $S C O N E$ ( $S$ calable, $C$ ontextualized, $O$ ffloaded, $N$ -gram $E$ mbedding), a new method for extending input embedding layers to enhance language model performance. To avoid increased decoding costs, $S C O N E$ retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. After training, embeddings are precomputed and stored in off-accelerator memory; during inference, querying them has minimal impact on latency due to the low complexity of embedding lookups. $S C O N E$ enables two new scaling strategies: increasing the number of n-gram embeddings and scaling the model used to learn them, both while maintaining fixed accelerator usage during inference (in terms of FLOPS and memory). We show that scaling both aspects enables…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsSparse Evolutionary Training