Cauchy-Schwarz Divergence Information Bottleneck for Regression
Shujian Yu, Xi Yu, Sigurd L{\o}kse, Robert Jenssen, Jose C. Principe

TL;DR
This paper introduces a novel Cauchy-Schwarz divergence-based information bottleneck method for regression, which improves estimation, generalization, and robustness without relying on variational inference or distributional assumptions.
Contribution
It develops a new IB framework for regression using Cauchy-Schwarz divergence, avoiding variational approximations and enhancing robustness and performance.
Findings
Outperforms existing deep IB methods on six real-world regression tasks.
Achieves the best trade-off between prediction accuracy and compression.
Provides strong adversarial robustness guarantees.
Abstract
The information bottleneck (IB) approach is popular to improve the generalization, robustness and explainability of deep neural networks. Essentially, it aims to find a minimum sufficient representation by striking a trade-off between a compression term and a prediction term , where refers to the mutual information (MI). MI is for the IB for the most part expressed in terms of the Kullback-Leibler (KL) divergence, which in the regression case corresponds to prediction based on mean squared error (MSE) loss with Gaussian assumption and compression approximated by variational inference. In this paper, we study the IB principle for the regression problem and develop a new way to parameterize the IB with deep neural networks by exploiting favorable properties of the Cauchy-Schwarz (CS) divergence. By doing so, we move…
Peer Reviews
Decision·ICLR 2024 poster
- The authors show the connections between CS divergence to MMD and HSIC. - The effect of CS divergence to generalization and adversarial robustness is well quantified. - Thorough discussions are provided for most of remarks or theoretic findings.
- The CS divergence estimation is based on the Gaussian kernel assumption, which will depend on the parameter $\sigma$. What is the effect of $\sigma$ to the IB performance is not shown clearly. - The KL IB using variational approach is friendly to optimization based methods (gradient-based approaches). On the other hand, CS IB method is based on Gaussian kernel assumption, which may require the tuning of $\sigma$. - I think it’s better to have a section of identifying what are some potential
The use of Cauchy-Schwarz divergence in information bottleneck approaches is reasonable and novel to my understanding. The authors derived an efficient algorithm for training IB approaches based on the Cauchy-Schwarz divergence. The authors demonstrate visible improvement over existing approaches.
The improvement over existing approaches is fairly limited. In many tasks, the improvement is as small as 0.1 RMSE (where the relative improvement is close to 0.01~0.04). I am not sure if the limited improvement on these datasets is particularly meaningful. It remains conceptually unclear to me why we want to use the Cauchy-Schwarz divergence. It is shown to be always <= KL divergence, but it is not clear how much smaller would this be (maybe only minimally).
- They introduced a new choice of loss function based on CS divergence for the IB method, which was often used with MSE based on Gaussian settings or MAE loss based on Laplace distribution. - They pointed out the challenges in estimating MI and the adoption of indirect methods, such as estimating upper bounds. They proposed an IB framework based on direct estimation by performing non-parametric estimation using KDE. - They pointed out that existing methods adopt indirect methods for MI estimatio
I would like to express my sincere respect for all the efforts the authors have invested in this paper. Unfortunately, however, I cannot strongly recommend this paper as an ICLR 2024 accepted paper for the following reasons: (1) a misalignment between the claims of contribution, the assumptions of theoretical analysis, and the content of theoretical analysis; (2) a lack of theoretical guarantees on the properties of the proposed estimations, and the unclear discussion of the pros and cons betwee
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
