Gradient-free variational learning with conditional mixture networks
Conor Heins, Hao Wu, Dimitrije Markovic, Alexander Tschantz, Jeff, Beck, Christopher Buckley

TL;DR
This paper introduces CAVI-CMN, a fast, gradient-free variational method for training probabilistic conditional mixture networks, offering efficient Bayesian inference with competitive accuracy and scalability.
Contribution
The paper presents a novel gradient-free variational inference approach for conditional mixture networks using conjugacy and Pólya-Gamma augmentation, enabling efficient training without gradients.
Findings
CAVI-CMN achieves competitive or superior accuracy compared to MLE with backpropagation.
The method maintains competitive runtime and scales well with input size and number of experts.
It provides full posterior distributions over model parameters, enhancing uncertainty quantification.
Abstract
Balancing computational efficiency with robust predictive performance is crucial in supervised learning, especially for critical applications. Standard deep learning models, while accurate and scalable, often lack probabilistic features like calibrated predictions and uncertainty quantification. Bayesian methods address these issues but can be computationally expensive as model and data complexity increase. Previous work shows that fast variational methods can reduce the compute requirements of Bayesian methods by eliminating the need for gradient computation or sampling, but are often limited to simple models. We introduce CAVI-CMN, a fast, gradient-free variational method for training conditional mixture networks (CMNs), a probabilistic variant of the mixture-of-experts (MoE) model. CMNs are composed of linear experts and a softmax gating network. By exploiting conditional conjugacy…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
- The CAVI-CMNs approach introduces a highly efficient gradient-free Bayesian inference method specifically designed for CMNs. Its reliance CAVI paired with conjugate priors and Pólya-Gamma augmentation, demonstrates innovation through a novel combination of existing ideas, especially for probabilistic models that typically depend on gradient-based methods. This efficiency could mark an important step forward in Bayesian neural network inference. - The paper ensures reproducibility by including
**Insufficient Experimental Validation:** The reliance on smaller UCI and synthetic datasets might not adequately demonstrate the robustness or scalability of the approach in real-world, high-dimensional applications. Without validation on challenging or large-scale datasets, such as the CIFAR-10 dataset[1] and the ImageNet[2] or real-time data streams, the method’s claims of scalability and efficiency could be viewed as unconvincing, limiting the impact and generalizability of the results. **E
While Poly-Gamma augmentations have been used for models with other likelihoods or for Gibbs samplers, their usage for multinomial logistic regression models is new as far as I am aware. The paper aims to address an important problem in that discriminative classifiers can yield not well-calibrated predictions when trained to maximize the log-likelihood, while Bayesian approaches can be computationally expensive. In their arguably simple examples, the authors demonstrate that their approach does
The empirical evaluation seems a bit limited. In particular, it would useful if the authors would consider additional baselines. In particular \ (i) variational approaches beyond BBVI for the same model, or at least with some standard control variates [1,2,3], \ (ii) stochastic-gradient MCMC algorithms [4,5] that can yield better uncertainties [6,7] and are more scalable than NUTS, \ (iii) different models (e.g. deep ensembles [8], BNNs) to see better assess how the reported results (accuracies,
The paper introduces a method that eliminates the need for gradient-based optimization, which is a significant advantage for reducing computational costs.
- Heavy notation: The notation is often overly complex, and impacts readability and understanding. It lacks consistency across sections, making it difficult to follow key points. For instance, the sadly non-numbered equation on page 3 is very unclear: 1. on line 150, authors define $p_k$, yet this does not appear (at least explicitly) in the equation above 2. there is no explanation nor definition about $\pi$. What is $p(\pi)$? and $p(z^n|\pi)$? 3. on line 150, authors say that
**Generative model of piecewise linear representations**: Forming representations from piecewise linear functions is a core component of modern deep learning architectures—with the best example being ReLU activations. This work presents an interesting generative perspective of these representations. While not explored in this paper, I could see this leading to a better way to specify informative priors for feature learning (i.e. draw the linear models from distributions over particular basis
1. **Conflict between method’s presentation and experimental evaluation**: The Introduction and beginning of Section 3.1 clearly motivates the work from the perspective of improving the uncertainty quantification abilities of deep learning architectures. Thus, when arriving at the experiments section (4), I was expecting the result to compare the performance of the CMN with natural baselines, such as a two-layer NN with ReLU activations, since this would be a non-Bayesian piecewise linear mod
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Gaussian Processes and Bayesian Inference
MethodsSoftmax · Variational Inference
