SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

Yan Sun,Guoxia Wang,Jinle Zeng,JiaBin Yang,Shuai Li,Li Shen,Dacheng Tao,DianHai Yu,Haifeng Wang

arXiv:2605.08809·cs.CL·May 12, 2026

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

Yan Sun,Guoxia Wang,Jinle Zeng,JiaBin Yang,Shuai Li,Li Shen,Dacheng Tao,DianHai Yu,Haifeng Wang

PDF

TL;DR

SimReg introduces an embedding similarity regularization technique that enhances large language model pretraining by improving token representation quality, leading to faster convergence and better zero-shot performance.

Contribution

This work pioneers the application of similarity-based regularization in large-scale LLM pretraining, demonstrating significant training acceleration and performance gains.

Findings

01

Training convergence accelerated by over 30%.

02

Zero-shot performance improved by over 1%.

03

Effective across dense and MoE architectures.

Abstract

Pretraining large language models (LLMs) with next-token prediction has led to remarkable advances, yet the context-dependent nature of token embeddings in such models results in high intra-class variance and inter-class similarity, thus hindering the efficiency of representation learning. While similarity-based regularization has demonstrated benefit in supervised fine-tuning and classification tasks, its application and efficacy in large-scale LLM pretraining remains underexplored. In this work, we propose the SimReg, an embedding similarity regularization loss that explicitly encourages token representations with the same ground-truth label within each sequence to be more similar, while enforcing separation from different-label tokens via a contrastive loss. Our analysis reveals that this mechanism introduces gains by enlarging multi-classification margins, thereby enabling more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.