Hidden Breakthroughs in Language Model Training
Sara Kangaslahti, Elan Rosenfeld, and Naomi Saphra

TL;DR
This paper introduces POLCA, a method to uncover hidden conceptual breakthroughs during language model training by decomposing loss changes, revealing interpretable phase transitions that enhance understanding of learning dynamics.
Contribution
The paper presents POLCA, a novel technique for decomposing loss variations to identify hidden, interpretable breakthroughs in model training that are obscured by traditional loss metrics.
Findings
POLCA successfully recovers clusters representing interpretative breakthroughs.
Hidden phase transitions are frequent and can be identified with the proposed method.
The method enhances unsupervised interpretability of training dynamics.
Abstract
Loss curves are smooth during most of model training, so visible discontinuities stand out as possible conceptual breakthroughs. Studying these breakthroughs enables a deeper understanding of learning dynamics, but only when they are properly identified. This paper argues that similar breakthroughs occur frequently throughout training but they are obscured by a loss metric that collapses all variation into a single scalar. To find these hidden transitions, we introduce POLCA, a method for decomposing changes in loss along arbitrary bases of the low-rank training subspace. We use our method to identify clusters of samples that share similar changes in loss during training, disaggregating the overall loss into that of smaller groups of conceptually similar data. We validate our method on synthetic arithmetic and natural language tasks, showing that POLCA recovers clusters that represent…
Peer Reviews
Decision·ICLR 2026 Poster
* Well motivated problem, and a clean derivation of a simple methodology for addressing it * The paper is very clearly written, with appropriate and well-captioned figures * Some quite compelling results in a toy setting with addition and carrying * The automatic labelling for clusters in the language model setting seems to be well-done, I found this interesting in its own right.
* Major: I am not completely convinced by the framing of “breakthroughs in the loss” being discovered in the smooth learning curve via POLCA for the larger language models. If I can summarise (8) we decompose the loss into a sum of components, which may be positive or negative. By their nature (since they depend on narrower subsets of the data distribution and directions in parameter space) these components will tend to have much more variety in their behaviour over training, and by construction
- The paper is well presented, with clear writing, high quality and intuitive figures, and a strong argumentative flow. - The selected problem of discovering training breakthroughs is interesting and pertinent to many areas, such as interpretability and the study of model training dynamics. - The new method, POLCA, is intuitive, well justified theoretically, and provides a satisfying means for attributing changes in loss to different directions in the parameter space. While it bears a strong sim
- [W1] The results presented are not entirely convincing in terms of showing that POLCA discovers training breakthroughs based on human interpretable features. In figure 4c and section 4.2, for instance, the authors claim that the first basis vector “recovers the digit skill”. However, clusters 1 and 3 are composites of multiple different digit positions, so it is difficult to say that all of those examples are having their loss improve because of the same skill, especially since one digit is ex
The authors compare their approach to the previously introduced Loss Change Allocation (LCA) and convincingly debate that the introduced POLCA differs significantly from LCA. The mathematical justification and description is sound and the authors take great care to describe computationally feasible implementation solutions. As for any unsupervised learning analysis, it is very difficult to define objective performance indicators as well as design demonstrative datasets where the algorithm finds
However, I see a few weaknesses: - The authors use frequently terms like "(conceptual) breakthrough", "phase transition", "skill", without providing a proper scientific definition of them. As much as these terms are used in the related (and cited) literature, it would be better to attempt a self-contained definition to avoid talking to a very specific audience only. - it looks to me that nothing in the introduced methodology is specific to language models, but rather it could be applied to any (
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications · Topic Modeling
