Deep Kernel Posterior Learning under Infinite Variance Prior Weights
Jorge Lor\'ia, Anindya Bhadra

TL;DR
This paper introduces a novel Bayesian deep neural network model with infinite variance weights that converges to a process with stable marginals, enabling stochastic kernels and improved representation learning.
Contribution
It demonstrates that deep neural networks with elliptically distributed weights of infinite variance converge to stable processes, extending prior Gaussian process results and enabling stochastic kernel representations.
Findings
Stable process convergence with infinite variance weights
Recursive linking of random covariance kernels in deep networks
Enhanced computational and statistical performance in experiments
Abstract
Neal (1996) proved that infinitely wide shallow Bayesian neural networks (BNN) converge to Gaussian processes (GP), when the network weights have bounded prior variance. Cho & Saul (2009) provided a useful recursive formula for deep kernel processes for relating the covariance kernel of each layer to the layer immediately below. Moreover, they worked out the form of the layer-wise covariance kernel in an explicit manner for several common activation functions. Recent works, including Aitchison et al. (2021), have highlighted that the covariance kernels obtained in this manner are deterministic and hence, precludes any possibility of representation learning, which amounts to learning a non-degenerate posterior of a random kernel given the data. To address this, they propose adding artificial noise to the kernel to retain stochasticity, and develop deep kernel inverse Wishart processes.…
Peer Reviews
Decision·ICLR 2025 Poster
- The technical aspects of the paper appear solid, although I have not checked the results in detail. - The experiments seem to effectively support the theoretical claims.
- The paper is somewhat challenging to approach, given its niche topic and highly technical content. It also feels quite text-heavy.
**Strengths:** - The proposed model is capable of representation learning. Specifically, Proposition 3 nicely shows that the feature at layer l depends on the data X, y observed in training, which is unlike many finite width models. It is nice to have this notion of representation learning formalised in this simple way. - The idea of the paper is relatively straightforward: everything is conditionally Gaussian given this scale parameter, which induces the heavy-tailed behaviour. I see this simpl
**Weaknesses:** - Unless I am mistaken, "The key finding is that the conditional mutual information decays at a slower rate for smaller α" should be "The key finding is that the mutual information [which itself is computed via MCMC as an expected conditional mutual information] decays at a slower rate for smaller α." Figure 1 shows a mutual information, not conditional mutual information. **Minor:** - Theorem 1. $J_\delta(\theta)$ can be computed explicitly, and this is obvious to people famil
The article * provides a discussion of references, theoretical results, and experiments. * discusses potential benefits in prediction and uncertainty quantification. * suggests feature learning that is not possible under a Gaussian process.
1. My main concern with the article is the writing and presentation, which I think need to be improved. The abstract gives a long discussion of prior works but ideally it should instead give a crisp description of the main points in the article. The lengthy discussion in the introduction comments on prior works and perceived limitations, but does not provide a sufficiently concise and clear description of the objective, motivation, and contributions of the present work. Terminology could be int
Code & Models
Videos
Taxonomy
TopicsFace and Expression Recognition · Machine Learning and ELM · Neural Networks and Applications
