MINDE: Mutual Information Neural Diffusion Estimation
Giulio Franzese, Mustapha Bounoua, Pietro Michiardi

TL;DR
This paper introduces MINDE, a novel neural diffusion-based method for estimating mutual information and entropy between random variables, demonstrating superior accuracy and consistency over existing techniques.
Contribution
The paper presents a new MI estimation approach using score-based diffusion models derived from the Girsanov theorem, enabling more accurate and consistent measurements.
Findings
Outperforms existing MI estimation methods on challenging distributions
Passes MI self-consistency tests such as data processing and additivity
Provides a unified framework for estimating MI and entropy using diffusion models
Abstract
In this work we present a new method for the estimation of Mutual Information (MI) between random variables. Our approach is based on an original interpretation of the Girsanov theorem, which allows us to use score-based diffusion models to estimate the Kullback Leibler divergence between two densities as a difference between their score functions. As a by-product, our method also enables the estimation of the entropy of random variables. Armed with such building blocks, we present a general recipe to measure MI, which unfolds in two directions: one uses conditional diffusion process, whereas the other uses joint diffusion processes that allow simultaneous modelling of two random variables. Our results, which derive from a thorough experimental protocol over all the variants of our approach, indicate that our method is more accurate than the main alternatives from the literature,…
Peer Reviews
Decision·ICLR 2024 poster
I really like that the authors used the Czyz benchmark data, and also the consistency tests. I also appreciate the creativity of the theoretical advancement, though I don't understand it (see below).
I was super excited to read this paper, because I love thinking about mutual information and entropy, and have recently been working on some related issues. The ideas are intriguing, and the results are impressive. So, the rest of this review will focus on the issues for me understanding the methods and results. 1. The biggest issue for me is that I almost immediately got lost. I know information theory pretty well, I learned it from Fred Jelinek before he died. That said, I know very little
1. The construction of the basic building blocks that establish the estimation of KL divergence and of the entropy is well organized and clearly written. 2. It’s interesting to see the SDE framework of diffusion models being used under the setting of MI estimation, which could inspire the research community to investigate diffusion models in new directions.
1. While the utilization of score-based diffusion models can be justified by the Girsanov Theorem, it’s unclear how they are used as **generative models** (*i.e.*, using the reverse-time SDE to generate samples) — it seems that only forward diffusion SDEs are needed, in order to train the score networks. Therefore, it’s a bit confusing when the authors wrote “we explore the problem of estimating MI using generative models” (Page 1), instead of something like “we explore the problem of estimating
The problem considered is of critical importance in several applied and theoretical fields. Existing estimators either fail in high dimensions or require large amounts of data to provide precise estimates. The results are quite impressive, the proposed estimator seems to outperform alternatives in most settings. Several aspects of MI estimation that make the estimation challenging that were originally introduced in [1] such as sparsity, dimensionality, long tails, transformations, data process
The organization of the paper makes it hard to follow. The measure theoretical notations make the paper inaccessible to the broader audience interested in using the estimator in applied settings. The contributions are not fully clear. The connections between score, KL, MI, and H existed before. In addition, it's an established fact that diffusion process models are more powerful density estimators specifically in higher dimensions making it less surprising that the MI and H estimators are super
Code & Models
Videos
Taxonomy
TopicsNeural Networks and Applications · Model Reduction and Neural Networks · Stochastic Gradient Optimization Techniques
MethodsDiffusion
