When is the mode functional the Bayes classifier?

Tilmann Gneiting

arXiv:1704.08979·math.ST·May 1, 2017

When is the mode functional the Bayes classifier?

Tilmann Gneiting

PDF

TL;DR

This paper examines the conditions under which the mode of the conditional probability distribution aligns with the Bayes classifier, highlighting its limitations under different cost structures.

Contribution

It clarifies when the mode functional coincides with the Bayes classifier and demonstrates its failure under non-zero-one loss scenarios.

Findings

01

Mode equals Bayes classifier under zero-one loss.

02

Mode fails to be optimal under other cost structures.

03

Provides theoretical insights into classification decision rules.

Abstract

In classification problems, the mode of the conditional probability distribution, i.e., the most probable category, is the Bayes classifier under zero-one or misclassification loss. Under any other cost structure, the mode fails to persist.

Equations14

p (i ∣ x) = pr (Y = i ∣ X = x) for i = 1, \dots, k,

p (i ∣ x) = pr (Y = i ∣ X = x) for i = 1, \dots, k,

\overset{ˉ}{G} (x) = i \mbox i f p (i ∣ x) = i^{'} = 1, \dots, k max p (i^{'} ∣ x),

\overset{ˉ}{G} (x) = i \mbox i f p (i ∣ x) = i^{'} = 1, \dots, k max p (i^{'} ∣ x),

\hat{G} (x) = i \mbox i f j = 1 \sum k L (i, j) p (j ∣ x) = i^{'} = 1, \dots, k min j = 1 \sum k L (i^{'}, j) p (j ∣ x) .

\hat{G} (x) = i \mbox i f j = 1 \sum k L (i, j) p (j ∣ x) = i^{'} = 1, \dots, k min j = 1 \sum k L (i^{'}, j) p (j ∣ x) .

\left(\begin{array}[]{cc}0&\;c\\ 2-c&\;0\end{array}\right)\!,

\left(\begin{array}[]{cc}0&\;c\\ 2-c&\;0\end{array}\right)\!,

\left(\begin{array}[]{ccccc}0&&a&&b\\ a&&0&&3-a-b\\ b&&3-a-b&&0\end{array}\right)\!,

\left(\begin{array}[]{ccccc}0&&a&&b\\ a&&0&&3-a-b\\ b&&3-a-b&&0\end{array}\right)\!,

2 b p (1 ∣ x) \geq (2 a - 3) p (2 ∣ x) + b, 2 a p (1 ∣ x) \geq (2 b - 3) p (3 ∣ x) + a;

2 b p (1 ∣ x) \geq (2 a - 3) p (2 ∣ x) + b, 2 a p (1 ∣ x) \geq (2 b - 3) p (3 ∣ x) + a;

\left(\begin{array}[]{ccccc}0&&c&&c\\ c&&0&&c\\ c&&c&&0\end{array}\right)\!,

\left(\begin{array}[]{ccccc}0&&c&&c\\ c&&0&&c\\ c&&c&&0\end{array}\right)\!,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

When is the mode functional the Bayes classifier?

Tilmann Gneiting

Heidelberg Institute for Theoretical Studies and Karlsruhe Institute of Technology

(March 17, 2024)

Abstract

In classification problems, the mode of the conditional probability distribution, i.e., the most probable category, is the Bayes classifier under zero-one or misclassification loss. Under any other cost structure, the mode fails to persist.

Consider a finite number of categories or classes labeled $1,\ldots,k$ . Let the random variable $Y$ denote the class label, and let $X$ be a random covariate or feature vector, for a unit at hand. A probabilistic classifier is a conditional probability distribution,

[TABLE]

which in typical practice is estimated from training data. In contrast, a deterministic classifier or decision rule $G(x)$ assigns a single class label to any realized feature.

The most common way of converting a probabilistic classifier into a decision rule is to use the mode functional $\bar{G}$ , which assigns the most probable class label, i.e.,

[TABLE]

where ties are resolved by randomization. If $L(i,j)$ denotes the loss or cost when $G(x)=i$ and class $j$ realizes, where $i,j=1,\ldots,k$ , the associated Bayes classifier or optimal decision rule $\hat{G}$ assigns the class that minimizes the expected loss, i.e.,

[TABLE]

The literature typically studies classification problems under zero-one or misclassification loss, where $L(i,j)=0$ if $i=j$ and $L(i,j)=1$ if $i\not=j$ , and it is well known that the mode functional is the associated Bayes classifier (Hastie et al. 2009, p. 21).

However, it has been argued that misclassification loss “is rarely what users of classification methods really want” (Hand 1997, p. 7) and that one would “be hard pressed to find an application in which the costs of different kinds of errors were the same” (Witten et al. 2011, p. 164). Witten et al. (2011, p. 167) further note that under cost structures other than misclassification loss the Bayes classifier “might be different” from the mode. As we now show, it will in fact be different, in the sense that under any other loss structure the mode fails to persist.

To demonstrate this, we invoke the reasonableness condition of Elkan (2001) and assume that $L(i,j)\geq L(i,i)$ for $i,j=1,\ldots,k$ , with at least one of the inequalities being strict. Adding constants columnwise concerns costs that depend on the outcome only, and multiplying all entries of the loss matrix by a positive number merely changes the monetary unit. Therefore, we may restrict attention to loss or cost matrices for which $L(i,i)=0$ , $L(i,j)\geq 0$ and $\sum_{i\not=j}L(i,j)=k(k-1)$ . In other words, the diagonal elements vanish, and the off-diagonal entries are nonnegative and average to one.

In the binary case $k=2$ we thus consider cost matrices of the form

[TABLE]

where $0\leq c\leq 2$ . The optimal decision is $\hat{G}(x)=1$ if $p(1\mid x)\geq c/2$ . If $c<1$ and $c/2\leq p(1\mid x)<1/2$ , the mode fails to be optimal; if $c>1$ , an analogous argument applies. Hence the mode functional is the Bayes classifier under zero-one loss only.

In the ternary case $k=3$ we may restrict attention to symmetric cost matrices of the form

[TABLE]

where $0\leq a\leq 3$ and $0\leq b\leq 3-a$ , for if the cost matrix is asymmetric, the above arguments apply to a principal submatrix. The optimal decision under the cost matrix (3) is $\hat{G}(x)=1$ if

[TABLE]

other cases are handled analogously. If $a=b=1$ , we recover zero-one loss, and the inequalities reduce to the conditions for the mode. Else, they yield functionals other than the mode.

When $k\geq 4$ we see from the ternary case that a necessary condition for the cost matrix to yield the mode functional as Bayes classifier is that every $3\times 3$ principal submatrix be of the form

[TABLE]

where $0\leq c\leq k(k-1)/6$ . Considering successive principal submatrices, and iterating the argument, we see that in fact $c=1$ . Hence, the mode functional (1) is the Bayes classifier (2) under zero-one loss only, subject to the above assumptions.

This result complements findings by Heinrich (2014) in the case of a continuous outcome, where it is not possible to find a loss function under which the mode functional is the Bayes predictor. In the discrete setting considered here, the rife failure of the most probable value to minimize expected loss may urge practitioners to work with probabilistic classifiers in lieu of deterministic decision rules, as advocated powerfully by Harrell (2015, Section 1.3).

Acknowledgements

This work was funded by the European Union Seventh Framework Programme under grant agreement 290976. The author is grateful for infrastructural support by the Klaus Tschira Foundation and thanks Werner Ehm, Alexander Jordan and Michael Strube for instructive discussions.

References

Elkan, C. (2001). The foundations of cost-sensitive learning. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Francisco, pp. 973–978.

Hand, D. J. (1997). Construction and Assessment of Classification Rules. Wiley, Chichester.

Harrell, F. E. (2015). Regression Modeling Strategies, 2nd edition. Springer, Cham.

Hastie, T., Tibshirani, R. and Friedman, J. (2009). The Elements of Statistical Learning, 2nd edition. Springer, New York.

Heinrich, C. (2014). The mode functional is not elicitable. Biometrika 101, 245–251.

Witten, I. H., Frank, E. and Hall, M. A. (2011). Data Mining, 3rd edition. Elsevier Morgan Kaufmann, Amsterdam.