AdaDim: Dimensionality Adaptation for SSL Representational Dynamics
Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

TL;DR
AdaDim is a novel training strategy for SSL that adaptively balances feature diversity and mutual information to improve representation quality without complex techniques.
Contribution
The paper introduces AdaDim, a dynamic method that leverages SSL training dynamics to balance feature decorrelation and mutual information, enhancing SSL performance.
Findings
Up to 3% performance improvement over baseline SSL methods.
AdaDim does not require queues, clustering, or predictor networks.
Effective balancing of feature diversity and mutual information improves representations.
Abstract
A key factor in effective Self-Supervised learning (SSL) is preventing dimensional collapse, where higher-dimensional representation spaces () span a lower-dimensional subspace. Therefore, SSL optimization strategies involve guiding a model to produce with a higher dimensionality () through objectives that encourage decorrelation of features or sample uniformity in . A higher indicates that has greater feature diversity which is useful for generalization to downstream tasks. Alongside dimensionality optimization, SSL algorithms also utilize a projection head that maps into an embedding space . Recent work has characterized the projection head as a filter of noisy or irrelevant features from the SSL objective by reducing the mutual information . Therefore, the current literature's view is that a good SSL representation space should have a high…
Peer Reviews
Decision·Submitted to ICLR 2026
- Clear dynamics story with theory. The Gaussian and information-flow analyses predict two phases (decorrelation vs. uniformity), and the measurements on real runs (eigenvalue trajectories, uniformity, matrix-MI) match those predictions, including the non-monotonic behavior of I(R;Z). - Low-overhead method. AdaDim’s α from effective rank(Z) plus β-scaled MI regularization is easy to implement (periodic SVD on a few batches) and does not require queues, clustering, student–teacher, or extra pred
- Estimator and stability details for I(R;Z). The matrix-based Rényi-entropy estimator is used operationally, but its bias/variance, windowing, and normalization details are compactly stated. I could see a sensitivity study (batch size, feature dim, normalization choice) as beneficial. - Theory idealization. The Gaussian joint model and the assumption that I(Y;Z) approaches a constant simplify the narrative; stronger connections to non-Gaussian encoders/projectors or a small-scale non-linear to
- Incorporating I(R;Z) ( which is related to the degree of invariance of the representations) with the anti-collapse is useful. Similar ideas were studied in Figure 4 of [1] which could help enrich the discussion. - The analysis focusing on training dynamics is appreciated, as previous works mainly focus on the loss itself or the final representations. - The analysis of I(Y;R) is interesting, although since it is an upper and not lower bound it should be interpreted cautiously. - The introduc
1) While performance gains are visible with standardized hyperparameters, in more ideal setting (where every method is tuned to the best of their ability) gains are more marginal 2) Misrepresentation of previous work line 85. [41](from the paper's numbering) Does use the degree of invariance of models to augmentations, which can be related to I(R;Z). [17](from the paper's numbering) also does not measure H(R) but H(Z), which can be related to both H(R) and I(R;Z). 3) From Figure 6, it seems th
- The paper brings a new view to self-supervised learning by linking representation dimensionality and mutual information. It gives both mathematical and empirical evidence to support this idea. - AdaDim is lightweight and easy to apply. It only changes the loss weighting dynamically and does not require any architectural modification. - The authors test on CIFAR, Tiny-ImageNet, medical datasets, and ImageNet-100. The method shows consistent improvement across settings. - The writing is well
The reviewer's main concern is about the experiments: - The reported improvement in Table 4&5 is marginal. - The paper does not study how performance changes with different batch sizes or temperature values. These are well-known factors that strongly affect contrastive learning. Larger batch sizes might reduce the improvement effect. - The method is tested only on small or medium datasets. Results on larger benchmarks such as full ImageNet would better support the generalization claim. Minor fo
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Neural Networks and Applications
