Optimal Posteriors for Chi-squared Divergence based PAC-Bayesian Bounds   and Comparison with KL-divergence based Optimal Posteriors and   Cross-Validation Procedure

Puja Sahu; Nandyala Hemachandra

arXiv:2008.07330·math.ST·August 18, 2020·1 cites

Optimal Posteriors for Chi-squared Divergence based PAC-Bayesian Bounds and Comparison with KL-divergence based Optimal Posteriors and Cross-Validation Procedure

Puja Sahu, Nandyala Hemachandra

PDF

Open Access

TL;DR

This paper compares chi-squared divergence based PAC-Bayesian bounds with KL-divergence based bounds, deriving optimal posteriors, analyzing their properties, and evaluating their performance on classifiers, highlighting differences in bounds, test errors, and computational efficiency.

Contribution

It introduces methods to compute optimal posteriors for chi-squared divergence bounds, compares them with KL-divergence posteriors, and assesses their practical performance and computational aspects.

Findings

01

Chi-squared divergence posteriors have weaker bounds and worse test errors.

02

KL-divergence based posteriors are more effective in test error performance.

03

Proposed fixed point equations enable fast computation of optimal posteriors.

Abstract

We investigate optimal posteriors for recently introduced \cite{begin2016pac} chi-squared divergence based PAC-Bayesian bounds in terms of nature of their distribution, scalability of computations, and test set performance. For a finite classifier set, we deduce bounds for three distance functions: KL-divergence, linear and squared distances. Optimal posterior weights are proportional to deviations of empirical risks, usually with subset support. For uniform prior, it is sufficient to search among posteriors on classifier subsets ordered by these risks. We show the bound minimization for linear distance as a convex program and obtain a closed-form expression for its optimal posterior. Whereas that for squared distance is a quasi-convex program under a specific condition, and the one for KL-divergence is non-convex optimization (a difference of convex functions). To compute such optimal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Sparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques

MethodsSupport Vector Machine