Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

Chen Liu; Xingzhi Sun; Xi Xiao; Alexandre Van Tassel; Ke Xu; Kristof Reimann; Danqi Liao; Mark Gerstein; Tianyang Wang; Xiao Wang; Smita Krishnaswamy

arXiv:2602.00217·cs.LG·May 6, 2026

Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

Chen Liu, Xingzhi Sun, Xi Xiao, Alexandre Van Tassel, Ke Xu, Kristof Reimann, Danqi Liao, Mark Gerstein, Tianyang Wang, Xiao Wang, Smita Krishnaswamy

PDF

1 Repo

TL;DR

This paper introduces a dispersion loss to counteract embedding condensation in small language models, improving their generalization and performance without increasing parameters.

Contribution

It identifies embedding condensation as a geometric issue in small models and proposes a dispersion loss to mitigate it, enhancing model performance.

Findings

01

Dispersion loss reduces embedding condensation in small models.

02

Mitigating condensation improves performance across multiple benchmarks.

03

Larger models are less susceptible to embedding condensation.

Abstract

Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in smaller models. We observe a geometric phenomenon which we term $embedding condensation$ , where token embeddings collapse into a narrow cone-like subspace in some language models. Through systematic analyses across multiple Transformer families, we show that small models such as $GPT2$ and $Qwen3-0.6B$ exhibit severe condensation, whereas larger models such as $GPT2-xl$ and $Qwen3-32B$ are more resistant to this phenomenon. Additional observations show that embedding condensation is not reliably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chenliu-1996/LM-Dispersion
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.