Slicing Mutual Information Generalization Bounds for Neural Networks
Kimia Nadjahi, Kristjan Greenewald, Rickard Br\"uel Gabrielsson,, Justin Solomon

TL;DR
This paper introduces new information-theoretic generalization bounds for neural networks based on slicing the parameter space, which are tighter, more scalable, and empirically validated, improving understanding and control of model generalization.
Contribution
The paper develops novel bounds for neural network generalization using sliced mutual information, incorporating rate-distortion theory and proposing a regularization method to enhance compressibility and generalization.
Findings
Sliced mutual information bounds are tighter and more scalable.
Regularization based on compressibility improves generalization.
Empirical validation shows bounds are non-vacuous for neural networks.
Abstract
The ability of machine learning (ML) algorithms to generalize well to unseen data has been studied through the lens of information theory, by bounding the generalization error with the input-output mutual information (MI), i.e., the MI between the training data and the learned hypothesis. Yet, these bounds have limited practicality for modern ML applications (e.g., deep learning), due to the difficulty of evaluating MI in high dimensions. Motivated by recent findings on the compressibility of neural networks, we consider algorithms that operate by slicing the parameter space, i.e., trained on random lower-dimensional subspaces. We introduce new, tighter information-theoretic generalization bounds tailored for such algorithms, demonstrating that slicing improves generalization. Our bounds offer significant computational and statistical advantages over standard MI bounds, as they rely on…
Peer Reviews
Decision·ICML 2024 Poster
The paper explores tightening information-theoretic generalization bounds while introducing a technique to evaluate this bound in high dimensional settings. This is indeed an interesting and important topic in deep learning. The proposed training procedures based on the rate-distortion framework is also novel and interesting.
The sliced information-theoretic bound, while interesting, may be computationally expensive especially in high dimensions. The bound requires training multiple models for multiple projection matrices. The authors resolved this by using quantization that avoids the estimation of MI, but no comparison was made on how loose this bound is compared to the sliced MI bound. Overall, the bounds also become increasingly loose as the dimension increases. Also, more evaluation is needed for Section 4.2 to
This paper attacks an important problem regarding obtaining an information-theoretic generalization bound that is easy to estimate. The idea based on the slicing seems very interesting and it gives rise to a connection to the sliced mutual information.
My main concern with the paper is that the main message of the paper is not clear. I think the idea of projection of parameters is interesting and it is intuitive that we can have a smaller generalization error. My understanding of the main message of this paper is that: slicing is interesting since we can have a better estimator of the information-theoretic generalization bounds. However, I think that only obtaining numerical values may not the only goal of the generalization theory. I appreci
### Originality The work is original, improving upon previous work to take into account the dimension of the parameter manifold, and the importance of quantization. ### Quality Both toy models, and deep neural network training is studied. Insight can be useful for practionners and theoreticians. Experiments are convincing. ### Clarity The paper is clear overall, but not without defaults (see comments below). The literature review is very accessible. The authors do a good job at making th
### Clarity There is a lack of clarity or details at times (see **Questions**). ### Bias-variance tradeoff > On the other hand, decreasing d may increase the training error, implying a tradeoff between generalization error and training error when selecting d. and > The choice of d is also important and can be tuned to balance the MI term with the distortion required (how small λ needs to be) to achieve low training error. Most of the bounds derived by the author can be understood
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
