Information Flow in Self-Supervised Learning
Zhiquan Tan, Jingqin Yang, Weiran Huang, Yang Yuan, Yifan Zhang

TL;DR
This paper provides a theoretical analysis of self-supervised learning methods using matrix mutual information and entropy, introduces a new M-MAE method, and demonstrates its superior performance on ImageNet.
Contribution
It offers a novel theoretical framework linking loss functions to matrix mutual information and entropy, and proposes M-MAE, a new method that improves performance on image classification tasks.
Findings
M-MAE outperforms state-of-the-art methods on ImageNet.
Theoretical link between loss functions and matrix mutual information.
Empirical results show significant accuracy improvements.
Abstract
In this paper, we conduct a comprehensive analysis of two dual-branch (Siamese architecture) self-supervised learning approaches, namely Barlow Twins and spectral contrastive learning, through the lens of matrix mutual information. We prove that the loss functions of these methods implicitly optimize both matrix mutual information and matrix joint entropy. This insight prompts us to further explore the category of single-branch algorithms, specifically MAE and U-MAE, for which mutual information and joint entropy become the entropy. Building on this intuition, we introduce the Matrix Variational Masked Auto-Encoder (M-MAE), a novel method that leverages the matrix-based estimation of entropy as a regularizer and subsumes U-MAE as a special case. The empirical evaluations underscore the effectiveness of M-MAE compared with the state-of-the-art methods, including a 3.9% improvement in…
Peer Reviews
Decision·ICML 2024 Poster
+ The paper introduces matrix information theory to understand and connect mainstream methods in self-supervised learning, which is a very meaningful and valuable contribution. + The writing is fairly clear, although I did not delve into the mathematical details, I believe they are sounds. + The initial results on image classification (both the linear-proving and the fine-tuning results) are great.
- While the initial empirical results are great, I do hope to see the final results after having a complete run on MAE. The current version of the paper uses U-MAE's implementation and the hyper-parameters (e.g., batch size) do not follow the settings in MAE. This can cause some discrepancies. MAE's ViT-L, after convergence, can achieve an accuracy of ~85.5 on ImageNet. While the paper's result is promising, it is unclear the trend can still hold. So I would be curious to see. If it is too much
**Theoretical Innovation**: The manuscript presents a novel theoretical framework for analyzing self-supervised learning (SSL) methods through the prism of matrix information theory. The use of matrix mutual information and joint entropy is an innovative approach that could provide new insights into the dependencies between features and the propagation of information within neural networks. This theoretical advancement has the potential to deepen our understanding of SSL mechanisms, making it a
**Scalability of Information-Theoretic Measures**: A potential weakness could be the lack of a clear discussion on the scalability of the proposed matrix mutual information and joint entropy measures. Calculating these metrics can be computationally intensive, especially for large-scale datasets and high-dimensional feature spaces typical in self-supervised learning. Any insights on the computational overheads is much appreciated
This paper attempts to understand self-supervised learning methods in terms of the mutual information maximization framework, where each random variable is from the online encoder and the target encoder. It could be a somewhat valuable attempt to understand how self-supervised learning methods work.
1. There is no detailed discussion between the introduced matrix entropy (definition 1) and the original Shannon's information entropy. In fact, The Renyi entropy is defined by a special matrix family named density matrix. It is not defined for an arbitrary matrix, but only for a Gram matrix. However, this paper misuses the concept of matrix entropy throughout the whole paper. For example, In page 14, the last paragraph says that "Take K1 = Z1 and K2 = Z2, the results follow similarly". However,
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Neural Networks and Applications · Domain Adaptation and Few-Shot Learning
MethodsBarlow Twins · Masked autoencoder
