Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Huanran Chen, Huaqing Zhang, Xiao Li, Yinpeng Dong, Ke Shen, Jun Zhu

TL;DR
This paper introduces the Nexus optimizer, which encourages convergence to common minima across data sources, significantly improving downstream performance of large language models without changing pretraining loss.
Contribution
The paper proposes the Nexus optimizer that promotes common minima in pretraining, leading to better downstream generalization compared to standard optimizers.
Findings
Nexus improves downstream performance across various models and data mixtures.
Nexus reduces out-of-distribution loss and enhances complex reasoning accuracy.
Standard optimizers often converge to distant task-specific minima, hindering generalization.
Abstract
Pretraining is the cornerstone of Large Language Models (LLMs), dominating the vast majority of computational budget and data to serve as the primary engine for their capabilities. During pretraining, LLMs acquire foundational knowledge from an unprecedentedly massive and diverse data sources, encompassing a vast array of domains such as general language, mathematics, code, and complex reasoning. In this work, we investigate an interesting geometric question regarding the converged state of pretraining: Does the model converge to a common minimizer across all data sources (e.g., \cref{fig:cwa_illustration:close}), or merely a minimizer of the summed loss (e.g., \cref{fig:cwa_illustration:distant})? We hypothesize that the geometric "closeness" of task-specific minima is intrinsically linked to downstream generalization. We reveal that standard optimizers (e.g., AdamW) often converge to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
