TL;DR
This paper introduces a dispersion loss to counteract embedding condensation in small language models, improving their generalization and performance without increasing parameters.
Contribution
It identifies embedding condensation as a geometric issue in small models and proposes a dispersion loss to mitigate it, enhancing model performance.
Findings
Dispersion loss reduces embedding condensation in small models.
Mitigating condensation improves performance across multiple benchmarks.
Larger models are less susceptible to embedding condensation.
Abstract
Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in smaller models. We observe a geometric phenomenon which we term , where token embeddings collapse into a narrow cone-like subspace in some language models. Through systematic analyses across multiple Transformer families, we show that small models such as and exhibit severe condensation, whereas larger models such as and are more resistant to this phenomenon. Additional observations show that embedding condensation is not reliably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
