On the Universality of the Logistic Loss Function
Amichai Painsky, Gregory W. Wornell

TL;DR
This paper demonstrates that for binary classification, the divergence from smooth, proper, convex loss functions is bounded by the KL divergence, justifying the widespread use of log-loss across various machine learning models.
Contribution
It establishes that minimizing log-loss bounds the divergence of other loss functions, providing a theoretical foundation for its broad application.
Findings
KL divergence bounds other convex loss divergences from above
Log-loss minimizes an upper bound to various loss functions
Introduces new divergence inequalities similar to Pinsker inequality
Abstract
A loss function measures the discrepancy between the true values (observations) and their estimated fits, for a given instance of data. A loss function is said to be proper (unbiased, Fisher consistent) if the fits are defined over a unit simplex, and the minimizer of the expected loss is the true underlying probability of the data. Typical examples are the zero-one loss, the quadratic loss and the Bernoulli log-likelihood loss (log-loss). In this work we show that for binary classification problems, the divergence associated with smooth, proper and convex loss functions is bounded from above by the Kullback-Leibler (KL) divergence, up to a multiplicative normalization constant. It implies that by minimizing the log-loss (associated with the KL divergence), we minimize an upper bound to any choice of loss functions from this set. This property justifies the broad use of log-loss in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
