On the Dynamics & Transferability of Latent Generalization during Memorization
Simran Ketha, Venkatakrishnan Ramaswamy

TL;DR
This paper investigates how latent generalization in deep networks evolves during training, especially under memorization, and explores methods to extract and transfer this generalization using linear and non-linear probes.
Contribution
It introduces a new linear probe for extracting latent generalization and demonstrates how to transfer this generalization to improve model performance.
Findings
Latent generalization peaks early in training.
The MASC probe is a quadratic (non-linear) classifier.
Linear probes can partially extract latent generalization.
Abstract
Deep networks have been known to have extraordinary generalization abilities, via mechanisms that aren't yet well understood. It is also known that upon shuffling labels in the training data to varying degrees, deep networks, trained with standard methods, can still achieve perfect or high accuracy on this corrupted training data. This phenomenon is called memorization, and typically comes at the cost of poorer generalization to true labels. Our recent work has demonstrated, that the internal representations of such models retain significantly better latent generalization abilities than is directly apparent from the model. In particular, it has been shown that such latent generalization can be recovered via simple probes (called MASC probes) on the layer-wise representations of the model. However, the origin and dynamics over training of this latent generalization during memorization is…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* This paper tries to address a timely problem in machine learning
* Latent generalization is not well defined, no evidence for this is phenomena provided (or well cited reference). * The MASC-is-quadratic proof feels orthogonal to the central empirical story; the paper doesn’t explain why that nonlinearity matters for practice or theory * Evaluation is limited to small datasets/older architectures; there’s no evidence that claims hold on modern large-scale setups (e.g., ResNets/ViTs on ImageNet-1k, language models). * Writing/organization are rough, making it
1. The paper is well-written and easy to follow. The phenomenon studied is interesting and a better understanding of it would reveal a lot about the training dynamics of neural networks and their surprising robustness to noise. 2. The results on the network being more robust to noise when initialized with the Velpic vectors is interesting. It does depend on the training data used for Velpic, as I'm asking later in this review, but I find it surprising that such an intervention before training ca
1. The contributions of this work are not very clear to me. The work largely builds upon the previous work [1] that introduced the MASC classifier, and the initial set of experiments seem to largely obtain the same insights. What seems maybe novel is the dynamics aspect by tracking the performance throughout training but the behavior is not very surprising to me. The new classifier, Velpic, seems like a more complicated way of training a linear classifier on top of the hidden states instead of
The paper dives into an interesting problem: the dynamics of learning under potential memorization and what we can still gain from it. It touches quite a few sub-directions in this topic and contributes at various angles.
My main concern for this work is its lacking of focus. There at least four items of investigation: 1) the dynamics of generalization power in latent representation vs. training epochs, 2) the underlying math nature of an existing probe, 3) a new probe, and 4) how the new probe can be used to transfer the generalization power to another models. Given that all these problem settings are relatively new, the authors may be putting too many good things in one paper: none of the problem settings is in
1) The experiments in the paper are extensive and support the claims made in the paper. 2) The new initialization scheme proposed demonstrates practical utility of the empirical understanding of latent generalization skills.
My main concern is that the method proposed draws heavily from the prior work [1]. In particular, the variant proposed is just a simple modification of MASC proposed in [1]. While MASC corresponds to using a subspace with the class-specific top-$m$ principal components (where $m$ varies and according to [1] is set as a hyperparameter choice), Velpic is equivalent to using $m=1$. This is because in the first case the quantity being maximized is $(x.p_1)^2$, while Velpic maximizes $x.p_1$, which a
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChild and Animal Learning Development · Generative Adversarial Networks and Image Synthesis · Neural Networks and Applications
