Model Selection in Utility-Maximizing Binary Prediction

Jiun-Hua Su

arXiv:1903.00716·econ.EM·September 29, 2021

Model Selection in Utility-Maximizing Binary Prediction

Jiun-Hua Su

PDF

Open Access

TL;DR

This paper introduces a utility-maximizing prediction rule (UMPR) that reduces overfitting in binary classification, providing theoretical bounds and demonstrating improved utility over standard estimators under misspecification.

Contribution

It develops a new UMPR method with theoretical guarantees to mitigate overfitting in utility-based binary prediction models.

Findings

01

UMPR achieves higher generalized expected utility under misspecification.

02

Non-asymptotic bounds are established for the difference in utility.

03

Simulation results confirm UMPR's superior performance over common estimators.

Abstract

The maximum utility estimation proposed by Elliott and Lieli (2013) can be viewed as cost-sensitive binary classification; thus, its in-sample overfitting issue is similar to that of perceptron learning. A utility-maximizing prediction rule (UMPR) is constructed to alleviate the in-sample overfitting of the maximum utility estimation. We establish non-asymptotic upper bounds on the difference between the maximal expected utility and the generalized expected utility of the UMPR. Simulation results show that the UMPR with an appropriate data-dependent penalty achieves larger generalized expected utility than common estimators in the binary classification if the conditional probability of the binary outcome is misspecified.

Tables2

Table 1. Table 1: Relative Generalized Expected Utility of ML, MU, and UMPR

n=500
DGP1	$p^{*} (x) = Λ (- 0.5 x + 0.2 x^{3})$
Preference	$b (x) = 20$ and $c (x) = 0.5$				$b (x) = 20$ and $c (x) = 0.5 + 0.025 x$
	$k = 1$	$k = 2$	$k = 3$		$k = 1$	$k = 2$	$k = 3$
ML	34.69	31.72	93.93		8.50	11.52	94.70
MU	51.05	55.33	67.15		33.44	45.56	58.40
	$α = 1$	$α = 0.5$	$α = 0.1$	$α = 0.05$	$α = 1$	$α = 0.5$	$α = 0.1$	$α = 0.05$
UMPR (VC)	54.60	55.07	55.35	55.35	36.88	37.01	37.57	37.70
UMPR (MD)	58.93	59.77	60.71	60.83	47.64	49.59	51.07	51.10
	Cross-Validated $\hat{α}$		No Technical Term		Cross-Validated $\hat{α}$		No Technical Term
UMPR (VC)	54.72		56.51		37.12		41.32
UMPR (MD)	59.59		64.65		49.27		58.43
DGP2	$p^{*} (x_{1}, x_{2}) = Λ (Q (1.5 x_{1} + 1.5 x_{2}))$ where $Q (u) = \frac{1.5 - 0.1 u}{\exp {0.25 u + 0.1 u^{2} - 0.04 u^{3}}}$
Preference	$b (x_{1}, x_{2}) = 20$ and $c (x_{1}, x_{2}) = 0.75$				$b (x_{1}, x_{2}) = 20 + 40 \cdot 𝟙_{[\| x_{1} + x_{2} \| < 1.5]}$ and $c (x_{1}, x_{2}) = 0.75$
	$k = 1$	$k = 2$	$k = 3$		$k = 1$	$k = 2$	$k = 3$
ML	60.26	59.41	60.09		30.86	29.19	34.60
MU	67.44	51.78	68.14		49.33	33.10	51.87
	$α = 1$	$α = 0.5$	$α = 0.1$	$α = 0.05$	$α = 1$	$α = 0.5$	$α = 0.1$	$α = 0.05$
UMPR (VC)	67.44	67.44	67.44	67.44	49.33	49.33	49.33	49.33
UMPR (MD)	68.28	68.25	68.36	68.34	49.30	49.29	49.48	49.47
	Cross-Validated $\hat{α}$		No Technical Term		Cross-Validated $\hat{α}$		No Technical Term
UMPR (VC)	67.44		67.44		49.33		49.33
UMPR (MD)	68.22		68.13		49.38		50.07
n=1000
DGP1	$p^{*} (x) = Λ (- 0.5 x + 0.2 x^{3})$
Preference	$b (x) = 20$ and $c (x) = 0.5$				$b (x) = 20$ and $c (x) = 0.5 + 0.025 x$
	$k = 1$	$k = 2$	$k = 3$		$k = 1$	$k = 2$	$k = 3$
ML	32.24	31.32	97.21		6.83	7.09	97.48
MU	54.88	58.48	69.94		35.17	48.03	60.04
	$α = 1$	$α = 0.5$	$α = 0.1$	$α = 0.05$	$α = 1$	$α = 0.5$	$α = 0.1$	$α = 0.05$
UMPR (VC)	60.26	60.73	60.84	60.84	44.26	44.74	45.34	45.55
UMPR (MD)	64.07	64.74	65.19	65.47	55.20	56.11	57.68	57.78
	Cross-Validated $\hat{α}$		No Technical Term		Cross-Validated $\hat{α}$		No Technical Term
UMPR (VC)	60.40		62.26		44.30		49.30
UMPR (MD)	64.51		67.94		56.13		63.11
DGP2	$p^{*} (x_{1}, x_{2}) = Λ (Q (1.5 x_{1} + 1.5 x_{2}))$ where $Q (u) = \frac{1.5 - 0.1 u}{\exp {0.25 u + 0.1 u^{2} - 0.04 u^{3}}}$
Preference	$b (x_{1}, x_{2}) = 20$ and $c (x_{1}, x_{2}) = 0.75$				$b (x_{1}, x_{2}) = 20 + 40 \cdot 𝟙_{[\| x_{1} + x_{2} \| < 1.5]}$ and $c (x_{1}, x_{2}) = 0.75$
	$k = 1$	$k = 2$	$k = 3$		$k = 1$	$k = 2$	$k = 3$
ML	59.03	57.64	59.74		28.13	25.07	31.77
MU	69.87	55.43	71.19		52.89	39.38	56.19
	$α = 1$	$α = 0.5$	$α = 0.1$	$α = 0.05$	$α = 1$	$α = 0.5$	$α = 0.1$	$α = 0.05$
UMPR (VC)	69.87	69.87	69.87	69.87	52.89	52.89	52.89	52.89
UMPR (MD)	70.46	70.50	70.54	70.55	53.02	53.15	54.03	54.07
	Cross-Validated $\hat{α}$		No Technical Term		Cross-Validated $\hat{α}$		No Technical Term
UMPR (VC)	69.87		69.87		52.89		52.89
UMPR (MD)	70.48		70.84		53.31		56.55

Table 2. Table 2: Relative Generalized Expected Utility of UMPR, AIC, BIC, LASSO and SVM

n=500
DGP1	$p^{*} (x) = Λ (- 0.5 x + 0.2 x^{3})$
Preference	$b (x) = 20$ and $c (x) = 0.5$				$b (x) = 20$ and $c (x) = 0.5 + 0.025 x$
UMPR	MD	SMD	RC	BC	MD	SMD	RC	BC
	65.36	66.68	66.86	65.74	55.00	58.87	58.58	57.65
Information	AIC	BIC			AIC	BIC
Criterion	93.93	89.95			94.70	88.81
$ℓ_{1}$ -Penalty	LASSO	SVM			LASSO	SVM
	60.60	87.77			65.20	83.91
DGP2	$p^{*} (x_{1}, x_{2}) = Λ (Q (1.5 x_{1} + 1.5 x_{2}))$ where $Q (u) = \frac{1.5 - 0.1 u}{\exp {0.25 u + 0.1 u^{2} - 0.04 u^{3}}}$
Preference	$b (x_{1}, x_{2}) = 20$ and $c (x_{1}, x_{2}) = 0.75$				$b (x_{1}, x_{2}) = 20 + 40 \cdot 𝟙_{[\| x_{1} + x_{2} \| < 1.5]}$ and $c (x_{1}, x_{2}) = 0.75$
UMPR	MD	SMD	RC	BC	MD	SMD	RC	BC
	68.55	69.52	69.47	69.11	50.41	53.87	53.32	52.90
Information	AIC	BIC			AIC	BIC
Criterion	60.07	60.27			33.20	30.90
$ℓ_{1}$ -Penalty	LASSO	SVM			LASSO	SVM
	59.75	26.86			32.93	5.92
n=1000
DGP1	$p^{*} (x) = Λ (- 0.5 x + 0.2 x^{3})$
Preference	$b (x) = 20$ and $c (x) = 0.5$				$b (x) = 20$ and $c (x) = 0.5 + 0.025 x$
UMPR	MD	SMD	RC	BC	MD	SMD	RC	BC
	69.32	72.51	72.23	71.75	63.30	67.12	67.01	65.81
Information	AIC	BIC			AIC	BIC
Criterion	97.21	97.13			97.48	97.29
$ℓ_{1}$ -Penalty	LASSO	SVM			LASSO	SVM
	68.82	93.26			78.92	91.14
DGP2	$p^{*} (x_{1}, x_{2}) = Λ (Q (1.5 x_{1} + 1.5 x_{2}))$ where $Q (u) = \frac{1.5 - 0.1 u}{\exp {0.25 u + 0.1 u^{2} - 0.04 u^{3}}}$
Preference	$b (x_{1}, x_{2}) = 20$ and $c (x_{1}, x_{2}) = 0.75$				$b (x_{1}, x_{2}) = 20 + 40 \cdot 𝟙_{[\| x_{1} + x_{2} \| < 1.5]}$ and $c (x_{1}, x_{2}) = 0.75$
UMPR	MD	SMD	RC	BC	MD	SMD	RC	BC
	71.09	71.91	71.97	71.89	57.13	59.61	60.08	58.96
Information	AIC	BIC			AIC	BIC
Criterion	59.72	59.06			31.49	28.16
$ℓ_{1}$ -Penalty	LASSO	SVM			LASSO	SVM
	59.68	25.93			29.08	5.10

Equations391

a \in {- 1, 1} max E [U (a, Y, X) ∣ X = x] .

a \in {- 1, 1} max E [U (a, Y, X) ∣ X = x] .

a^{*} (x) \equiv {1, - 1, if p^{*} (x) > c (x), otherwise,

a^{*} (x) \equiv {1, - 1, if p^{*} (x) > c (x), otherwise,

c (x) \equiv \frac{u _{- 1, - 1} ( x ) - u _{1, - 1} ( x )}{u _{1, 1} ( x ) - u _{- 1, 1} ( x ) + u _{- 1, - 1} ( x ) - u _{1, - 1} ( x )}

c (x) \equiv \frac{u _{- 1, - 1} ( x ) - u _{1, - 1} ( x )}{u _{1, 1} ( x ) - u _{- 1, 1} ( x ) + u _{- 1, - 1} ( x ) - u _{1, - 1} ( x )}

f max S (f) \equiv E [b (X) [Y + 1 - 2 c (X)] sign (f (X) - c (X))],

f max S (f) \equiv E [b (X) [Y + 1 - 2 c (X)] sign (f (X) - c (X))],

b (x) \equiv u_{1, 1} (x) - u_{- 1, 1} (x) + u_{- 1, - 1} (x) - u_{1, - 1} (x)

b (x) \equiv u_{1, 1} (x) - u_{- 1, 1} (x) + u_{- 1, - 1} (x) - u_{1, - 1} (x)

\hat{f} \in ar g f \in F max S_{n} (f) \equiv \frac{1}{n} i = 1 \sum n b (X_{i}) [Y_{i} + 1 - 2 c (X_{i})] sign (f (X_{i}) - c (X_{i})),

\hat{f} \in ar g f \in F max S_{n} (f) \equiv \frac{1}{n} i = 1 \sum n b (X_{i}) [Y_{i} + 1 - 2 c (X_{i})] sign (f (X_{i}) - c (X_{i})),

b (x) [y + 1 - 2 c (x)] sign (f (x) - c (x))

b (x) [y + 1 - 2 c (x)] sign (f (x) - c (x))

=

\hat{f} \in ar g f \in F min \frac{1}{n} i = 1 \sum n b (X_{i}) [Y_{i} (1 - 2 c (X_{i})) + 1] \mathbbm 1_{[Y_{i} \neq = sign (f (X_{i}) - c (X_{i}))]},

\hat{f} \in ar g f \in F min \frac{1}{n} i = 1 \sum n b (X_{i}) [Y_{i} (1 - 2 c (X_{i})) + 1] \mathbbm 1_{[Y_{i} \neq = sign (f (X_{i}) - c (X_{i}))]},

0 =

0 =

=

s (y, x, f) = b (x) [y + 1 - 2 c (x)] sign (f (x) - c (x))

s (y, x, f) = b (x) [y + 1 - 2 c (x)] sign (f (x) - c (x))

s (y, x, f) = {2 (u_{1, 1} (x) - u_{- 1, 1} (x)) sign (f (x) - c (x)), 2 (u_{1, - 1} (x) - u_{- 1, - 1} (x)) sign (f (x) - c (x)), if y = 1, if y = - 1.

s (y, x, f) = {2 (u_{1, 1} (x) - u_{- 1, 1} (x)) sign (f (x) - c (x)), 2 (u_{1, - 1} (x) - u_{- 1, - 1} (x)) sign (f (x) - c (x)), if y = 1, if y = - 1.

F_{1} \subseteq F_{2} \subseteq \dots \subseteq F_{k} \subseteq \dots and F \equiv k = 1 ⋃ \infty F_{k} .

F_{1} \subseteq F_{2} \subseteq \dots \subseteq F_{k} \subseteq \dots and F \equiv k = 1 ⋃ \infty F_{k} .

\hat{f}_{k} \in ar g f \in F_{k} max S_{n} (f) .

\hat{f}_{k} \in ar g f \in F_{k} max S_{n} (f) .

\tilde{S}_{n} (f; k, α) \equiv S_{n} (f) - C_{n} (k; α)

\tilde{S}_{n} (f; k, α) \equiv S_{n} (f) - C_{n} (k; α)

\tilde{f}_{n} (x; α) \equiv \hat{f}_{\hat{k}_{n} (α)} (x), where \hat{k}_{n} (α) = ar g k \in N max \tilde{S}_{n} (\hat{f}_{k}; k, α) .

\tilde{f}_{n} (x; α) \equiv \hat{f}_{\hat{k}_{n} (α)} (x), where \hat{k}_{n} (α) = ar g k \in N max \tilde{S}_{n} (\hat{f}_{k}; k, α) .

\tilde{f}_{n}^{IC} \equiv \hat{f}_{\overset{ˇ}{k}_{n}}^{ML}, where \overset{ˇ}{k}_{n} = ar g k \in N max \frac{1}{n} i = 1 \sum n L (\hat{f}_{k}^{ML} ∣ Y_{i}, X_{i}) - C_{n}^{IC} (k) .

\tilde{f}_{n}^{IC} \equiv \hat{f}_{\overset{ˇ}{k}_{n}}^{ML}, where \overset{ˇ}{k}_{n} = ar g k \in N max \frac{1}{n} i = 1 \sum n L (\hat{f}_{k}^{ML} ∣ Y_{i}, X_{i}) - C_{n}^{IC} (k) .

P (f \in F_{k} sup (S_{n} (f) - S (f)) > 8 M \frac{2 lo g { Π _{k, c} ( n )}}{n} + ε) \leq exp {- \frac{n ε ^{2}}{32 M ^{2}}},

P (f \in F_{k} sup (S_{n} (f) - S (f)) > 8 M \frac{2 lo g { Π _{k, c} ( n )}}{n} + ε) \leq exp {- \frac{n ε ^{2}}{32 M ^{2}}},

Π_{H} (ℓ) = (x_{1}, \dots, x_{ℓ}) \in X^{ℓ} max ∣ {(h (x_{1}), \dots, h (x_{ℓ})) : h \in H} ∣ .

Π_{H} (ℓ) = (x_{1}, \dots, x_{ℓ}) \in X^{ℓ} max ∣ {(h (x_{1}), \dots, h (x_{ℓ})) : h \in H} ∣ .

S_{n} (\hat{f}_{k}) - S (\hat{f}_{k}) \leq 8 M \frac{2 lo g { Π _{k, c} ( n )}}{n} + 8 M \frac{lo g { 1/ δ }}{2 n} .

S_{n} (\hat{f}_{k}) - S (\hat{f}_{k}) \leq 8 M \frac{2 lo g { Π _{k, c} ( n )}}{n} + 8 M \frac{lo g { 1/ δ }}{2 n} .

(Y_{1}, \dots, Y_{ℓ})^{⊤} \neq = (sign (f (x_{1}) - c (x_{1})), \dots, sign (f (x_{ℓ}) - c (x_{ℓ})))^{⊤} .

(Y_{1}, \dots, Y_{ℓ})^{⊤} \neq = (sign (f (x_{1}) - c (x_{1})), \dots, sign (f (x_{ℓ}) - c (x_{ℓ})))^{⊤} .

P (f \in F_{k} sup ∣ S_{n} (f) - S (f) ∣ > ε) \leq 2 exp {- \frac{n ε ^{2}}{128 M ^{2}}} .

P (f \in F_{k} sup ∣ S_{n} (f) - S (f) ∣ > ε) \leq 2 exp {- \frac{n ε ^{2}}{128 M ^{2}}} .

ψ_{c} (k, n) = ⎩ ⎨ ⎧ 2^{n}, (\frac{e n}{V _{k, c}})^{V_{k, c}}, if n \leq V_{k, c}, if n > V_{k, c},

ψ_{c} (k, n) = ⎩ ⎨ ⎧ 2^{n}, (\frac{e n}{V _{k, c}})^{V_{k, c}}, if n \leq V_{k, c}, if n > V_{k, c},

P (f \in F_{k} sup (S_{n} (f) - S (f)) > 8 M \frac{2 lo g { ψ _{c} ( k , n )}}{n} + ε) \leq exp {- \frac{n ε ^{2}}{32 M ^{2}}} .

P (f \in F_{k} sup (S_{n} (f) - S (f)) > 8 M \frac{2 lo g { ψ _{c} ( k , n )}}{n} + ε) \leq exp {- \frac{n ε ^{2}}{32 M ^{2}}} .

R_{n, k}^{VC} \equiv S_{n} (\hat{f}_{k}) - 8 M \frac{2 lo g { ψ _{c} ( k , n )}}{n} .

R_{n, k}^{VC} \equiv S_{n} (\hat{f}_{k}) - 8 M \frac{2 lo g { ψ _{c} ( k , n )}}{n} .

P (R_{n, k}^{VC} - S (\hat{f}_{k}) > ε) \leq

P (R_{n, k}^{VC} - S (\hat{f}_{k}) > ε) \leq

\leq

C_{n}^{VC} (k; α)

C_{n}^{VC} (k; α)

χ_{n} (k; α) \equiv \frac{( 1 + α ) lo g { V _{k, c} }}{2 n} .

χ_{n} (k; α) \equiv \frac{( 1 + α ) lo g { V _{k, c} }}{2 n} .

P (\tilde{S}_{n} (\tilde{f}_{n}) - S (\tilde{f}_{n}) > ε) \leq ζ (α_{0}) exp {- \frac{n ε ^{2}}{32 M ^{2}}},

P (\tilde{S}_{n} (\tilde{f}_{n}) - S (\tilde{f}_{n}) > ε) \leq ζ (α_{0}) exp {- \frac{n ε ^{2}}{32 M ^{2}}},

S^{*} - E [S (\tilde{f}_{n})] \leq k min {C_{n}^{VC} (k; α_{0}) + (S^{*} - S_{k}^{*})} + 8 M \frac{1 + lo g { ζ ( α _{0} )}}{2 n},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Imbalanced Data Classification Techniques · Statistical Methods and Inference

Full text

Model Selection in Utility-Maximizing Binary Prediction

Jiun-Hua Su

This paper is a revision of the third chapter of my dissertation. I am grateful to the co-editor, the associate editor, and two anonymous referees for constructive comments and suggestions. I also thank Peter Bartlett, Le-Yu Chen, Yu-Chin Hsu, Hsuan-Tien Lin, Demian Pouzo, James Powell, seminar participants at Erasmus University Rotterdam, and participants at the 2019 Australasian Meetings of the Econometric Society for helpful discussions. Address correspondence to Jiun-Hua Su, 128 Academia Road, Section 2, Nankang, Taipei, 115 Taiwan; E-mail address: [email protected]. Institute of Economics

Academia Sinica

Abstract

The maximum utility estimation proposed by Elliott and Lieli (2013) can be viewed as cost-sensitive binary classification; thus, its in-sample overfitting issue is similar to that of perceptron learning. A utility-maximizing prediction rule (UMPR) is constructed to alleviate the in-sample overfitting of the maximum utility estimation. We establish non-asymptotic upper bounds on the difference between the maximal expected utility and the generalized expected utility of the UMPR. Simulation results show that the UMPR with an appropriate data-dependent penalty achieves larger generalized expected utility than common estimators in the binary classification if the conditional probability of the binary outcome is misspecified.

Keywords: Decision-based binary prediction, Maximum utility estimation, Model selection, Structural risk minimization, Perceptron learning

JEL Classification: C14, C45, C52, C53

1 Introduction

Making a binary decision based on an uncertain binary outcome is common in modern economic activities. For instance, an investor who considers buying a financial instrument may tend to predict its price change in the future and decide to buy the instrument if the price is predicted to rise. As suggested by Granger and Machina (2006), decision-making based on the prediction of a binary outcome should be driven by the preference of the decision maker. On the one hand, the utility arising from a mismatch between the binary decision and outcome may differ in the realized outcome; on the other hand, the utility may be affected by observable covariates. In making financial investment decisions, the disutility for the investor who buys the instrument but suffers from a decrease in the price may be greater than that for the investor who does not buy the instrument but finds an increase in the price. In addition, features of the instrument, for example measures of its price volatility, may affect not only the likelihood of price change but also the investor’s utility.111Barberis and Xiong (2012) propose a model to explain the individual investor preference for volatile stocks.

Further examples illustrating the importance of the decision maker’s preference in economic forecasting are provided in Elliott and Timmermann (2016).

Although the subjective preference would be important for the decision-making based on binary prediction, traditional methods of pattern classification rarely take the decision maker’s utility into consideration. Recently, Elliott and Lieli (2013) propose a maximum utility approach to incorporate the decision maker’s utility into prediction of a binary outcome $Y\in\{-1,1\}$ given a vector $X$ of observed covariates. Instead of globally estimating the conditional probability $p^{*}(x)\equiv\mathbb{P}(Y=1\mid X=x)$ , they show the utility-maximizing binary classification problem can be solved by only estimating the sign of $p^{*}(x)-c(x)$ , where $c$ is the cutoff function determined by the decision maker’s utility function. Compared with maximum likelihood estimation, their maximum utility estimation is, however, prone to in-sample overfitting.

In this paper, we show that the maximum utility estimation can be viewed as cost-sensitive binary classification; thus, its in-sample overfitting issue is similar to that of perceptron learning in the machine learning literature. To alleviate the tendency of fitting the in-sample noise by sophisticated models, we follow the structural risk minimization approach proposed by Vapnik (1982). More precisely, we pre-specify a hierarchy of classes of (finite-dimensional) functions and consider a utility-maximizing prediction rule (UMPR), which is a maximum utility estimator that maximizes a complexity-penalized empirical utility. To construct the UMPR, we consider a VC-type distribution-free complexity penalty and four data-dependent complexity penalties: maximal discrepancy (Bartlett et al. (2002)), simulated maximal discrepancy, Rademacher complexity (Koltchinskii (2001) and Bartlett et al. (2002) among others), and bootstrap complexity (Fromont (2007)). We evaluate the performance of a prediction rule by the generalized expected utility, that is, the expected utility achieved by the decision maker applying this prediction rule, which could be constructed based on in-sample observations, to the classification of an out-of-sample observation. We emphasize generalization because the expected utility averages over not only the out-of-sample observation but the ex ante in-sample observations as well. We prove that the difference between the maximal expected utility and the generalized expected utility of the UMPR can be bounded by an almost optimal trade-off between the expected complexity penalty and the approximation error, that is, an error due to the approximation of functions in a hierarchy of classes to an optimal decision rule. Hence, whenever the approximation error is equal to zero for some class of functions, the expected utility of the UMPR increases in the sample size and will asymptotically attain the maximal expected utility. In other words, the proposed UMPR is universally utility consistent.222 Universal utility consistency has a counterpart in the literature on empirical risk minimization, in which different names are used, for example universal consistency in Devroye et al. (1996), persistence in Greenshtein and Ritov (2004) and Greenshtein (2006), and risk consistency in Homrighausen and McDonald (2013). Another conceptually different but common term is consistency, and in this paper, it refers to the property that a selection method asymptotically picks a model with the lowest Kullback-Leibler divergence, as in Sin and White (1996).

The idea of complexity penalization has been applied to selection methods in econometrics. Instead of penalized empirical utility criteria, penalized likelihood criteria are the main concerns in early literature. One strand of literature adopts the information-theoretic approach. Classical examples include Akaike’s (1973) information criterion (AIC), Schwarz’s (1978) information criterion (BIC), and their cousins (TIC and GIC among others).333 Schwarz’s information criterion is also called BIC, as it is usually analyzed from a Bayesian angle. TIC and GIC are abbreviations for Takeuchi’s information criterion and generalized information criterion, respectively. Additionally, motivated by different spirits, the leave-one-out cross-validation in the likelihood framework is asymptotically equivalent to the AIC, whereas the minimum description length can be approximated by the BIC. These early selection methods are well documented in Konishi and Kitagawa (2008) and Claeskens and Hjort (2008). With the purpose of achieving shrinkage and variable selection simultaneously, Tibshirani’s (1996) least absolute shrinkage and selection operator (LASSO) adopts an $\ell_{1}$ penalty. All of these selection methods can be applied to the binary prediction in maximum likelihood estimation with the logit specification. As an alternative to penalized maximum likelihood estimation, penalized maximum score estimation is recently applied to the variable selection in binary prediction by Chen and Lee (2018b), in which an $\ell_{0}$ penalty is used.444 Similarly, by setting a bound on the $\ell_{0}$ -norm of covariates, Chen and Lee (2018a) consider the constrained maximum score estimation to select covariates in binary prediction.

Furthermore, replacing the zero-one loss with the hinge loss, the $\ell_{1}$ -norm support vector machine (SVM), developed by Zhu et al. (2004) and Fung and Mangasarian (2004) among others in the machine learning community, can effectively select variables in the traditional binary classification, namely binary classification with a symmetric loss independent of covariates. However, none of these penalty-based selection methods above takes the decision maker’s utility into account.

Despite the prevalence of penalty-based selection methods, pretesting and cross-validation are two alternatives in literature. Both alternatives can be adapted for the selection in maximum utility estimation. Elliott and Lieli (2013) propose a general-to-specific pretest to select MU estimators but do not investigate theoretical properties of their post-model-selection MU estimator. However, as suggested by Leeb and Pötscher (2005, 2008), a post-model-selection estimator would have complicated distributional properties. The complicated distributional properties are partly attributable to using the same data for both estimation and validation. In contrast, data splitting makes it convenient to evaluate the out-of-sample performance of a prediction rule so that cross-validation can be used to select models in almost any framework. As argued by Arlot and Celisse (2010), the wide applicability of cross-validation however makes its predictive performance less satisfactory than that of selection methods tailored in a specific framework. The lack of theoretical analysis of the pretest and cross-validatory estimators motivates the evaluation by Monte Carlo experiments in this paper. According to the simulation results, the UMPR with an appropriate data-dependent penalty outperforms the pretest and cross-validatory estimators. The UMPR with an appropriate data-dependent penalty also outweighs the AIC, BIC, and LASSO if the conditional probability of the binary outcome is misspecified, and the $\ell_{1}$ -norm SVM if the cutoff function considerably deviates from $1/2$ .

The proposed method in this paper can be viewed as an essential complement to the literature concerned with model selection in binary classification. Although cost-sensitive binary classification is important for decision-making, previous studies on binary classification focus on performance evaluated by the cost-insensitive rate of misclassification. By applying McDiarmid’s (1989) inequality and Massart’s (2000) lemma, we extend the aforementioned complexity penalties to the maximum utility estimation, in which the cost of misclassification may depend on the binary outcome and covariates. Following the structural risk minimization approach, we establish non-asymptotic upper bounds on the difference between the maximal expected utility and the generalized expected utility of the UMPR. These bounds, similar to those in Koltchinskii (2001) and Bartlett et al. (2002) for the traditional binary classification, strike a balance between the approximation error and the expected complexity penalty. In addition, we establish a non-asymptotic upper bound, which shrinks to zero as the in-sample size tends to infinity, on the expected value of these data-dependent complexity penalties. We also extend properties of the Bayes decision rule to the maximum utility estimation. This extension not only confirms Elliott and Lieli’s (2013) insight that the knowledge of sign of $p^{*}(x)-c(x)$ suffices to achieve the maximal expected utility, but also implies that the approximation error is bounded by the uniform distance between the conditional probability $p^{*}$ and the underlying class of functions. Consequently, given pre-specified classes of functions, we can examine the universal utility consistency of UMPR, which is a property (corresponding to the risk consistency in loss minimization) rarely investigated in the previous studies on model selection in binary classification.

Throughout this paper, all random variables are defined on the probability space $(\Omega,\mathcal{A},\mathbb{P})$ . Data are assumed to be i.i.d. (independent and identically distributed). We write $\mathbb{N}$ for the collection of positive integers and $\mathbb{R}$ for the collection of real numbers. We denote the indicator function by $\mathbbm{1}_{[E]}$ , which equals one if event $E$ occurs and equals zero otherwise. We also denote the sign function by $\text{sign}(z)$ , which is equal to $2\mathbbm{1}_{[z\geq 0]}-1$ for any $z\in\mathbb{R}$ .

The structure of the remaining paper is as follows. Section 2 describes the maximum utility estimation in Elliott and Lieli’s (2013) model and the issue of its in-sample overfitting. Section 3 presents the construction of a utility-maximizing prediction rule based on different complexity penalties, and non-asymptotic upper bounds on the difference between the generalized expected utility of the prediction rule and the maximal expected utility. In Section 4, Monte Carlo experiments are carried out to evaluate the finite-sample performance of the proposed utility-maximizing prediction rule and the aforementioned estimators. Section 5 concludes. Technical proofs and steps of implementing pretest and cross-validation in the maximum utility estimation are collected in the Appendix.

2 Maximum Utility Estimation

2.1 Model

We start by describing Elliott and Lieli’s (2013) model of binary decision-making based on binary prediction: Before the realization of a binary outcome $Y\in\{-1,1\}$ , a decision maker aims to choose a binary decision $a\in\{-1,1\}$ to maximize his or her expected utility conditional on a $d$ -dimensional vector of observed covariates $X=x\in\mathcal{X}$ . Concretely, the decision maker solves the optimization problem

[TABLE]

We abbreviate by writing $u_{a,y}(x)=U(a,y,x)$ for notational simplicity.

Elliott and Lieli show that under some regularity assumptions, we can obtain an optimal decision rule (after observing $X=x$ )

[TABLE]

where

[TABLE]

is a cutoff function derived from the utility function, which is known in principle to the decision maker. Thus, knowledge of the correct conditional probability $p^{*}(x)$ yields the maximal expected utility of the decision.

Elliott and Lieli’s (2013) insight is that the correct specification of sign $(p^{*}(x)-c(x))$ rather than that of $p^{*}(x)$ is enough to achieve maximal expected utility in (1); namely, the knowledge of crossing points between the conditional probability $p^{*}(x)$ and the cutoff $c(x)$ is sufficient. Moreover, they point out that the decision-making problem in (1) can be equivalently written as

[TABLE]

where

[TABLE]

is the denominator of $c(x)$ and the maximum is taken over all measurable functions from $\mathcal{X}$ to $\mathbb{R}$ . Since $u_{a,y}(x)=0.25b(x)[y+1-2c(x)]a+0.25b(x)[y+1-2c(x)]+u_{-1,y}(x)$ , we normalize the utility function by setting $u_{-1,y}(x)=-0.25b(x)[y+1-2c(x)]$ for all $x\in\mathcal{X}$ and call $S(f)$ the expected utility of $f$ .

The insight motivates Elliott and Lieli (2013) to propose the maximum utility (MU) estimation. To be specific, given observations $\{(Y_{i},X_{i})\}_{i=1}^{n}$ with the sample size $n$ and a pre-specified class $\mathcal{F}$ of functions indexed by a finite-dimensional parameter, we can choose a measurable maximum utility estimator $\hat{f}\in\mathcal{F}$ that satisfies

[TABLE]

where “ $\arg$ ” stands for the set of estimators in $\mathcal{F}$ that achieve the optimum.555 Since multiplicity of the maximum utility estimator could be present, the analysis in this paper emphasizes the properties of optimand functions. See Elliott and Lieli (2013) for the discussion about the lack of identification of optimizers.

The Monte Carlo simulation in Elliott and Lieli (2013) shows that the maximum utility estimation, compared with traditional maximum likelihood approaches, achieves a large improvement in utility especially when the conditional probability $p^{*}$ is misspecified. However, the in-sample performance of the maximum utility estimation may be attributed to the overfitting. Elliott and Lieli further make the following comment:

Both ML and MU have a strong tendency to overfit in sample, however the problem seems more severe for the MU method. This creates challenges for model selection.

2.2 Nature of the Overfitting in MU Estimation

The in-sample overfitting of maximum utility estimation is similar to that of perceptron learning in the machine learning literature. The simple perceptron learning is a method of binary pattern recognition that establishes classification based on a threshold function $f(x)=\text{sign}(\theta_{1}^{\top}x-\theta_{0})$ , where $\theta_{1}\in\mathbb{R}^{d}$ and $\theta_{0}\in\mathbb{R}$ . More variants of perceptron are well documented in Vapnik (2000) and Anthony and Bartlett (1999). Since it can be shown that for any $(y,x)\in\{-1,1\}\times\mathcal{X}$ ,

[TABLE]

the optimization problem in (2) can be viewed as the simple perceptron learning in which the cost of misclassification for each observation $(Y_{i},X_{i})$ may be different. To be specific, the maximum utility estimator satisfies

[TABLE]

where $b(X_{i})[Y_{i}(1-2c(X_{i}))+1]$ is the cost of misclassification for the observation $(Y_{i},X_{i})$ , and the cost of misclassification is nonnegative under some mild assumptions. The simple perceptron learning is a special case of the maximum utility estimation because the cost of misclassification is identical for each observation whenever there is a constant $\bar{u}\in\mathbb{R}_{+}$ such that $u_{1,1}(x)-u_{-1,1}(x)=u_{-1,-1}(x)-u_{1,-1}(x)=\bar{u}$ for all $x\in\mathcal{X}$ . In this case, if $\mathcal{F}$ is further parameterized as $\{x\mapsto\theta_{0}+\theta_{1}^{\top}x:(\theta_{0},\theta_{1})^{\top}\in\mathbb{R}^{(d+1)}\}$ , the maximum utility estimation reduces to Manski’s (1975, 1985) maximum score estimation. Moreover, even though the cost of misclassification may be different for each observation $(Y_{i},X_{i})$ , when the in-sample observations (training data set) can be perfectly separated by $\mathcal{F}$ (i.e., classification without error), the maximum utility estimation boils down to the simple perceptron learning. This is because in this case, the cost of misclassification $b(X_{i})[Y_{i}(1-2c(X_{i}))+1]$ has no effect:

[TABLE]

Although perfect separation of the in-sample observations could be accomplished by a sufficiently large class of functions, such sophisticated models will also fit the in-sample noise and thus worsen the out-of-sample performance.

3 Model Selection

Motivated by possible in-sample overfitting, we adopt the structural risk minimization approach (also known as complexity regularization) in machine learning to investigate model selection in cost-sensitive binary classification. More precisely, our goal is to alleviate the overfitting by selecting a maximum utility estimator from some specific class of functions such that this selected maximum utility estimator, compared with maximum utility estimators from other classes of functions, has the largest complexity-penalized empirical utility.

To explain the idea, we first introduce notation. Let

[TABLE]

be the utility of the prediction rule $f$ evaluated at the observation $(y,x)$ .666 More precisely, $s(y,x,f)$ is the double extra gain (loss) in utility arising from a match (mismatch) between the decision $\text{sign}(f(x)-c(x))$ and outcome $y$ given the covariate $x$ , because

$\displaystyle s(y,x,f)=\begin{cases}2(u_{1,1}(x)-u_{-1,1}(x))\text{sign}(f(x)-c(x)),&\text{if}\;y=1,\\ 2(u_{1,-1}(x)-u_{-1,-1}(x))\text{sign}(f(x)-c(x)),&\text{if}\;y=-1.\end{cases}$

A sample of i.i.d. observations with sample size $n$ is denoted by $\mathscr{D}_{n}\equiv\{(Y_{i},X_{i})\}_{i=1}^{n}$ . Given a prediction rule $\hat{f}$ constructed based on $\mathscr{D}_{n}$ , let $S(\hat{f})=\operatorname*{\mathbb{E}}[s(Y,X,\hat{f})\mid\mathscr{D}_{n}]$ and $S_{n}(\hat{f})=\frac{1}{n}\sum_{i=1}^{n}s(Y_{i},X_{i},\hat{f})$ be the expected utility and the empirical utility of the prediction rule $\hat{f}$ , respectively. The expectation involved in the definition of $S(\hat{f})$ is taken with respect to an observation $(Y,X)$ , which is independent of $\mathscr{D}_{n}$ and distributed as $(Y_{1},X_{1})$ . Said differently, $S(\hat{f})$ measures the decision maker’s expected utility if he or she uses the prediction rule $\hat{f}$ , estimated based on $\mathscr{D}_{n}$ , to classify one additional observation $(Y,X)$ drawn independently of $\mathscr{D}_{n}$ . Note that $S(\hat{f})$ could be random because of the random sample $\mathscr{D}_{n}$ . We suppress the possible dependence of $S(\hat{f})$ on $\mathscr{D}_{n}$ for convenience of exposition.

The structural risk minimization approach consists of the following steps. First we consider a nondecreasing sieve $\{\mathcal{F}_{k}\}_{k=1}^{\infty}$ ,777 In the literature on sieve estimation, a metric space $(\mathcal{F}^{*},\rho)$ is usually pre-specified such that $\mathcal{F}\equiv\bigcup_{k=1}^{\infty}\mathcal{F}_{k}$ is dense in $\mathcal{F}^{*}$ with respect to $\rho$ . See for example Geman and Hwang (1982). In this paper, the denseness is however not assumed and $\mathcal{F}^{*}$ can be treated as the collection of all measurable real-valued functions.

that is, a hierarchy of classes of functions

[TABLE]

For example, $\mathcal{F}_{k}=\mathcal{P}_{k}$ is the class of polynomial transformations on $\mathcal{X}$ of order at most $k$ ,888 A polynomial transformation on $\mathcal{X}\subseteq\mathbb{R}^{d}$ of degree at most $k$ is a function of the form $f(x)=c_{0}+\sum_{j=1}^{q}c_{j}\varrho_{j}(x)$ , where $(c_{0},c_{1},\ldots,c_{q})\in\mathbb{R}^{(q+1)}$ and $\varrho_{j}(x)=\prod_{\ell=1}^{d}x_{\ell}^{p_{j\ell}}$ with $\sum_{\ell=1}^{d}p_{j\ell}\leq k$ and $p_{j\ell}\in\mathbb{N}\cup\{0\}$ for each $j$ and $q\in\mathbb{N}$ .

or it can be further transformed by the logistic function as $\mathcal{F}_{k}=\Lambda(\mathcal{P}_{k})\equiv\left\{x\mapsto\Lambda(f(x)):f\in\mathcal{P}_{k}\right\}$ , where $\Lambda(v)=(1+\exp{\{-v\}})^{-1}$ for all $v\in\mathbb{R}$ . We refer the reader to Chen (2007) for more examples of sieves. For each $\mathcal{F}_{k}$ , we select a maximum utility estimator

[TABLE]

In addition, we construct a complexity penalty $C_{n}(k;\alpha)$ for $\mathcal{F}_{k}$ , where $\alpha>0$ is a tuning parameter for a technical reason. This technical reason and the issue of choosing $\alpha$ will be discussed later. Let

[TABLE]

be the associated complexity-penalized empirical utility of a prediction rule $f\in\mathcal{F}_{k}$ . Finally, we define a utility-maximizing prediction rule (UMPR) as a maximum utility estimator $\hat{f}_{k}$ in (3) that maximizes $\tilde{S}_{n}(\hat{f}_{k};k,\alpha)$ ;999 Precisely, the UMPR should be referred to as $\text{sign}(\tilde{f}_{n}(\cdot\;;\alpha)-c(\cdot))$ . Since it will be clear from the context in this paper, we still reserve the UMPR for the estimator $\tilde{f}_{n}(\cdot\;;\alpha)$ . that is,

[TABLE]

For ease of presentation, we suppress the dependence of $(\tilde{S}_{n}(f;k,\alpha),\hat{k}_{n}(\alpha),\tilde{f}_{n}(\cdot\;;\alpha))$ on $\alpha$ and write $\tilde{S}_{n}(\tilde{f}_{n})=\tilde{S}_{n}(\hat{f}_{\hat{k}_{n}};\hat{k}_{n})$ for the complexity-penalized empirical utility of the UMPR.

The idea of complexity penalization has been used in the selection methods based on information criteria in econometrics. These methods aim to maximize the complexity-penalized empirical log-likelihood evaluated at the maximum likelihood estimator. Specifically, let $\mathcal{L}$ be the log-likelihood function of a single observation $(Y,X)$ and $\hat{f}^{\text{ML}}_{k}$ be the maximum likelihood estimator in $\mathcal{F}_{k}$ . Given a complexity measure $C^{\text{IC}}_{n}(k)$ for $\mathcal{F}_{k}$ , we can construct an estimator

[TABLE]

Leading examples include the AIC by setting $C^{\text{IC}}_{n}(k)$ to be the number of free parameters in $\mathcal{F}_{k}$ divided by $n$ , and the BIC by setting $C^{\text{IC}}_{n}(k)$ to be the number of free parameters in $\mathcal{F}_{k}$ multiplied by $\log{\{n\}}/(2n)$ . Although the AIC and BIC only differ in the choice of complexity penalty in the selection procedure, their asymptotic behaviors are different. Details on these differences can be found in Konishi and Kitagawa (2008), Claeskens and Hjort (2008), and references given there.

The UMPR shares a similar motivation with the AIC. Just as the AIC adjusts the empirical log-likelihood to approximate the expected log-likelihood, the UMPR adjusts the empirical utility to approximate the expected utility. Both adjustments are fulfilled by subtracting specific complexity penalties. However, the AIC attempts to recover $p^{*}$ , while the UMPR aims to select a model in which the decision maker only focuses on the local fitting at the crossing points between $p^{*}$ and $c$ .

As in the penalized likelihood criteria, the choice of complexity penalty $C_{n}(k;\alpha)$ is essential in the proposed penalized empirical utility criteria. First, the complexity penalty should be constructed without assuming any knowledge of the conditional probability $p^{*}$ so that the UMPR, like the maximum utility estimator, can perform well when $p^{*}$ is misspecified. More importantly, the complexity penalty should be an appropriate estimate of the magnitude of overfitting $S_{n}(\hat{f}_{k})-S(\hat{f}_{k})$ . In this case, the expected utility $S(\hat{f}_{k})$ can be recovered by the penalized empirical utility $\tilde{S}_{n}(\hat{f}_{k};k)$ and thus the UMPR $\tilde{f}_{n}$ will have the largest expected utility $S(\tilde{f}_{n})$ among the maximum utility estimators $\{\hat{f}_{k}\}_{k=1}^{\infty}$ . Taking these two requirements into account, we will construct complexity penalties, without assuming the knowledge of $p^{*}$ , to non-asymptotically bound the in-sample overfitting. This non-asymptotic complexity-regularized approach is in marked contrast to the information-theoretic approach, for example AIC and GIC, where penalties are obtained by an asymptotic approximation of the Kullback-Leibler divergence but may not effectively control the in-sample overfitting. As will be shown later, the non-asymptotic complexity penalties may differ from the information-theoretic complexity penalties in the order of the in-sample size. We discuss both distribution-free and data-dependent penalty terms constructed by the non-asymptotic complexity-regularized approach, and study the theoretical properties of their associated utility-maximizing prediction rules in the following subsections.

3.1 UMPR with a Distribution-Free Penalty

Since the seminal work by Vapnik and Chervonenkis (1971), there have been many improvements in the VC-type upper bound on the uniform deviation of empirical means from their expectations. Lugosi and Zeger (1996) further applied the VC-type upper bound to finding a complexity penalty in the traditional binary classification. Motivated by this idea, we aim to construct a VC complexity penalty for the maximum utility estimation.

We start by making the following assumptions.

Assumptions

(A1)

The conditional probability $p^{*}(x)\equiv\mathbb{P}(Y=1\mid X=x)$ does not depend on the decision $a$ . 2. (A2)

For all $x$ in the support $\mathcal{X}\subseteq\mathbb{R}^{d}$ of $X$ , $u_{1,1}(x)>u_{-1,1}(x)$ and $u_{-1,-1}(x)>u_{1,-1}(x)$ . 3. (A3)

For any $a,y\in\{1,-1\}$ , $u_{a,y}(\cdot)$ is Borel measurable; in addition, there is some $M>0$ such that $|u_{a,y}(x)|\leq M$ for all $x\in\mathcal{X}$ and $a,y\in\{1,-1\}$ . 4. (A4)

For each $k\in\mathbb{N}$ , the class $\mathcal{F}_{k}$ of functions is countable.

The first three assumptions are imposed in Elliott and Lieli (2013), and the last assumption is imposed to avoid measurability complications. Assumption (A1) excludes the possibility of feedback from the binary action to the binary outcome. Take the financial investment in Section 1 as an example. Under Assumption (A1), investors are price takers whose decisions on buying an instrument do not affect the possibility of price change. Assumption (A2) implies that the decision maker obtains higher utility when the decision matches the outcome; in particular, there should be a best response $a$ to a realized outcome $Y$ if this realization were observed by the decision maker before making a decision. This assumption seems plausible in many situations, for example the aforementioned financial investment. The uniform boundedness imposed by Assumption (A3) implies some shape constraint on the utility functions, especially when the support $\mathcal{X}$ is unbounded. However, this assumption could be compatible with some models of the financial investment. An example is the exponential utility (also known as constant absolute risk aversion preference) used by Christensen et al. (2012) in an asset pricing model when the decision maker is risk averse. Assumption (A4) is inconsequential in practice because there are only countably many computable real numbers evaluated for the parameters of $\mathcal{F}_{k}$ by a computer program. The use of computable real numbers and functions could also be interpreted as a decision maker’s computability-bounded rationality, as in Richter and Wong (1999).

These assumptions allow us to establish a VC-type upper bound on the uniform deviation of $S_{n}(f)$ from $S(f)$ .

Proposition 1.

Suppose that i.i.d. data $\mathscr{D}_{n}=\{(Y_{i},X_{i})\}_{i=1}^{n}$ are available. Under Assumptions (A1)-(A4), we have for any $n,k\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

*where $\Pi_{k,c}(\cdot)$ is the growth function of $\mathcal{F}_{k,c}\equiv\left\{x\mapsto\text{sign}(f(x)-c(x)):f\in\mathcal{F}_{k}\right\}$ .101010 For any collection $\mathcal{H}$ of functions from $\mathcal{X}$ to $\{-1,1\}$ , the growth function $\Pi_{\mathcal{H}}:\mathbb{N}\to\mathbb{N}$ of $\mathcal{H}$ is

$\displaystyle\Pi_{\mathcal{H}}(\ell)=\max_{(x_{1},\ldots,x_{\ell})\in\mathcal{X}^{\ell}}\left|\{(h(x_{1}),\ldots,h(x_{\ell})):h\in\mathcal{H}\}\right|.$

That is, the growth function $\Pi_{\mathcal{H}}(\ell)$ is the maximum number of distinct ways in which $\ell$ points $(x_{1},\ldots,x_{\ell})$ can be classified using functions in $\mathcal{H}$ . *

The maximal inequality in Proposition 1 suggests that large empirical utility arising from sophisticated models does not guarantee large expected utility. To see this, note that for any $n,k\in\mathbb{N}$ and $\delta\in(0,1)$ , with probability at least $1-\delta$ ,

[TABLE]

Thus, given the sample size $n$ , an increase in $k$ tends to increase empirical utility $S_{n}$ , but it may meanwhile increase $\Pi_{k,c}(n)$ . The growth function $\Pi_{k,c}$ measures the complexity of $\mathcal{F}_{k,c}$ to fit in-sample observations. Clearly, $\Pi_{k,c}(\ell)\leq 2^{\ell}$ for each $\ell\in\mathbb{N}$ . If $\Pi_{k,c}(\ell)<2^{\ell}$ for some $\ell$ , the complexity of $\mathcal{F}_{k,c}$ is restricted because some specific $\ell$ observations $\{(Y_{i},X_{i})\}_{i=1}^{\ell}$ cannot be separated by $\mathcal{F}_{k,c}$ without any classification error.111111 There are $2^{\ell}-\Pi_{k,c}(\ell)$ possible realizations of $(Y_{1},\cdots,Y_{\ell})^{\top}$ such that for all $(x_{1},\cdots,x_{\ell})^{\top}$ and $f\in\mathcal{F}_{k}$ ,

$\displaystyle(Y_{1},\cdots,Y_{\ell})^{\top}\neq(\text{sign}(f(x_{1})-c(x_{1})),\cdots,\text{sign}(f(x_{\ell})-c(x_{\ell})))^{\top}.$

Proposition 1 also implies the asymptotic behavior of $\sup_{f\in\mathcal{F}_{k}}|S_{n}(f)-S(f)|$ as follows.

Corollary 1.

Suppose that the growth function $\Pi_{k,c}$ is of polynomial order for each $k\in\mathbb{N}$ . If the assumptions of Proposition 1 hold, then for any $k\in\mathbb{N}$ and $\varepsilon>0$ , there exists an integer $n^{*}$ such that

[TABLE]

for all $n\geq n^{*}$ .

This corollary immediately guarantees that given i.i.d. observations, $|S_{n}(\hat{f}_{k})-S(\hat{f}_{k})|$ converges almost surely to zero whenever $\Pi_{k,c}$ is of polynomial order. Hence, the technical conditions imposed by Proposition 2 of Elliott and Lieli (2013) such as compactness of parameter space and lipschitz continuity of functions with respect to the parameter can be relaxed.

More importantly, the maximal inequality in Proposition 1 is non-asymptotic; thus, it can be used to estimate the upper bound on $S_{n}(\hat{f}_{k})-S(\hat{f}_{k})$ for every finite sample size $n$ when the growth function $\Pi_{k,c}$ is known. The calculation of $\Pi_{k,c}(n)$ is, however, not easy in practice. Thus, $\Pi_{k,c}(n)$ is usually replaced with an upper bound

[TABLE]

where $V_{k,c}$ is the VC dimension of the class $\mathcal{F}_{k,c}$ , which is the largest integer $\ell$ such that $\Pi_{k,c}(\ell)=2^{\ell}$ by definition. By Proposition 1, the replacement of $\Pi_{k,c}(n)$ with $\psi_{c}(k,n)$ shows that for any $n,k\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

The upper bound $\psi_{c}(k,n)$ follows from a combinatorial result, which is known as Sauer’s lemma and can be found in Theorems 3.6 and 3.7 of Anthony and Bartlett (1999). As a parameter involved in $\psi_{c}(k,n)$ , the VC dimension $V_{k,c}$ , like the growth function, also restricts the complexity of $\mathcal{F}_{k,c}$ . Note that $\Pi_{k,c}(V_{k,c}+1)<2^{(V_{k,c}+1)}$ if $V_{k,c}<\infty$ . Thus, the classification error of using $\mathcal{F}_{k,c}$ to classify some realization of $\{(Y_{i},X_{i})\}_{i=1}^{n}$ is impossibly eliminated whenever the in-sample size $n$ is greater than $V_{k,c}$ . As shown in Theorem 3.5 of Anthony and Bartlett (1999), the VC dimension $V_{k,c}$ is equal to the dimension of $\mathcal{F}_{k}$ if this class $\mathcal{F}_{k}$ is specified as a vector space of real-valued functions. For example, the VC dimension $V_{k,c}$ is ${d+k\choose k}$ if $\mathcal{F}_{k}$ is the class $\mathcal{P}_{k}$ of polynomial transformations on $\mathcal{X}$ of order at most $k$ in the absence of dummy covariates. Even if we consider the logit specification, say $\mathcal{F}_{k,c}=\left\{x\mapsto\text{sign}(f(x)-c(x)):f\in\Lambda(\mathcal{P}_{k})\right\}$ , then its VC dimension $V_{k,c}$ can be bounded by ${d+k\choose k}+1$ .121212 To see this, note that for a class of Boolean functions, the VC dimension is equal to the VC index minus one by definition. Since both functions $\text{sign}(\cdot)$ and $\Lambda(\cdot)$ are monotone, the VC index of $\mathcal{F}_{k,c}$ is less than or equal to that of $\mathcal{P}_{k}$ by Lemma 9.9 (v) and (viii) of Kosorok (2008). Applying Lemma 9.6 of Kosorok (2008) yields the result.

More generally, if $\mathcal{F}_{k}$ is a VC-subgraph class,131313 Let $\mathcal{C}$ be a collection of subsets of $\mathcal{Z}$ . The collection $\mathcal{C}$ is said to shatter a subset $\mathcal{Z}_{\ell}=\{z_{1},\ldots,z_{\ell}\}\subseteq\mathcal{Z}$ if the cardinality of $\left\{\{\mathcal{Z}_{\ell}\cap C\}:C\in\mathcal{C}\right\}$ is equal to $2^{\ell}$ . The collection $\mathcal{C}$ is called a Vapnik-Cervonekis (VC) class if for some $\ell\in\mathbb{N}$ , no subset of cardinality $\ell$ is shattered by $\mathcal{C}$ . A collection $\mathcal{F}$ is a VC-subgraph class if the collection $\big{\{}\{(x,t)\in\mathcal{X}\times\mathbb{R}:t<f(x)\}:f\in\mathcal{F}\big{\}}$ of all subgraphs is a VC class of sets in $\mathcal{X}\times\mathbb{R}$ .

then the VC dimension $V_{k,c}$ equals the VC index of $\mathcal{F}_{k}$ minus one by Lemma 9.9 of Kosorok (2008). We refer the reader to Section 2.6 of van der Vaart and Wellner (1996) and Section 9.1 of Kosorok (2008) for properties of a VC-subgraph class.

The easily computable VC-type upper bound permits the construction of a distribution-free complexity penalty. For each $k$ , we consider an estimate of expected utility $S(\hat{f}_{k})$ to be

[TABLE]

It follows that for each $k$ , we obtain a non-asymptotic upper bound on the tail probability for $R^{\text{VC}}_{n,k}-S(\hat{f}_{k})$ . To be specific, we have for any $n,k\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

Following the suggestion in Bartlett et al. (2002), we consider the VC complexity penalty

[TABLE]

where

[TABLE]

The VC complexity penalty is the sum of the estimate $S_{n}(\hat{f}_{k})-R_{n,k}$ of the magnitude of overfitting and a technical term $8M\chi_{n}(k;\alpha)$ . Treating $S_{n}(\hat{f}_{k})-R_{n,k}$ as a component of the VC complexity penalty, the non-asymptotic complexity-regularized approach explicitly accounts for the in-sample overfitting. The VC complexity penalty differs from the penalties for the information-theoretic approach in the order of $n$ . For example, the VC complexity penalty (without the technical term) is $8M\sqrt{2(k+1)\log\{en/(k+1)\}/n}$ if we consider the specification $\mathcal{F}_{k}=\mathcal{P}_{k}$ of univariate polynomial functions for the UMPR. Given the logistic transformation of the same specification $\mathcal{F}_{k}=\mathcal{P}_{k}$ in the empirical log-likelihood, the AIC and BIC have the penalties $(k+1)/n$ and $(k+1)\log\{n\}/(2n)$ , respectively. In this example, the VC complexity penalty would be greater than the penalties used in AIC and BIC when a large sample is available. The different convergence rates should be attributed to the difference in the underlying objective function: the AIC and BIC are both associated with the empirical log-likelihood function, whereas the UMPR is associated with the empirical utility function.141414 The derivation of penalties for AIC and BIC relies on twice differentiability of the empirical log-likelihood function. See for example Sections 3.4 and 9.1 of Konishi and Kitagawa (2008). The empirical utility function is, however, not differentiable. Despite the non-differentiability, a minimax lower bound on $\operatorname*{\mathbb{E}}[S_{n}(\hat{f}_{k})]-\operatorname*{\mathbb{E}}[S(\hat{f}_{k})]$ could still be established. Let $\mathcal{P}(\mathcal{F}_{k})$ be the set of all distributions of $(Y,X)$ such that $p^{*}\in\mathcal{F}_{k}$ . In the special case that $b(x)=b>0$ and $c(x)=1/2$ for all $x\in\mathcal{X}$ , if $\mathcal{F}_{k}$ has a finite VC index greater than 2, then Inequality (38) of Massart and Nédélec (2006) implies $\sup_{\mathbb{P}\in\mathcal{P}(\mathcal{F}_{k})}\operatorname*{\mathbb{E}}[S_{n}(\hat{f}_{k})]-\operatorname*{\mathbb{E}}[S(\hat{f}_{k})]\geq\operatorname{\Omega}\left(1/\sqrt{n}\right)$ , where the notation $\operatorname{\Omega}\left(1/\sqrt{n}\right)$ indicates that there exist positive constants $\kappa_{0}$ and $n_{0}$ such that this lower bound is grater than $\kappa_{0}/\sqrt{n}$ for all $n\geq n_{0}$ . Thus, the VC complexity penalty is near optimal in the minimax sense. This lower bound can be improved under a margin restriction on $p^{*}$ . Details about the margin restriction can be found in Massart and Nédélec (2006).

The technical term $\chi_{n}(k;\alpha)$ is included in the penalty to guarantee the summability of $\zeta(\alpha)\equiv\sum_{k=1}^{\infty}V_{k,c}^{-(1+\alpha)}$ for some $\alpha_{0}$ such that the union bound holds nontrivially in the proof of the following theorem.

Theorem 1.

Suppose that (i) the data $\mathscr{D}_{n}=\{(Y_{i},X_{i})\}_{i=1}^{n}$ are i.i.d., (ii) Assumptions (A1)-(A4) hold, (iii) $\mathcal{F}_{k}$ is a VC-subgraph class for each $k$ , and (iv) $\zeta(\alpha_{0})<\infty$ for some $\alpha_{0}$ . If the UMPR $\tilde{f}_{n}$ is constructed based on the penalty $C^{\text{VC}}_{n}$ with tuning parameter $\alpha_{0}$ , then for any $n\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

and

[TABLE]

where $S^{*}\equiv\sup_{f}S(f)$ and $S_{k}^{*}\equiv\sup_{f\in\mathcal{F}_{k}}S(f)$ for each $k$ .

Theorem 1 implies a probabilistic lower bound on the expected utility $S(\tilde{f}_{n})$ ; that is, for any $n\in\mathbb{N}$ and $\delta\in(0,1)$ ,

[TABLE]

with probability at least $1-\delta$ . Theorem 1 also shows an upper bound on the difference between the maximal expected utility $S^{*}$ and the generalized expected utility $\operatorname*{\mathbb{E}}[S(\tilde{f}_{n})]$ . This upper bound takes into account the trade-off between the complexity penalty $C^{\text{VC}}_{n}(k;\alpha_{0})$ and the approximation error $S^{*}-S^{*}_{k}$ . Furthermore, if the approximation error is equal to zero for some $k$ , then $\operatorname*{\mathbb{E}}[S(\tilde{f}_{n})]$ converges to $S^{*}$ because the upper bound on this difference shrinks to zero as the sample size tends to infinity. In this case, the convergence of $\operatorname*{\mathbb{E}}[S(\tilde{f}_{n})]$ to $S^{*}$ is equivalent to the convergence of $S(\tilde{f}_{n})$ to $S^{*}$ in probability because $\sup_{f\in\mathcal{F}}|S(f)|\leq 4M$ under Assumption (A3). In fact, we can establish the almost sure convergence of $S(\tilde{f}_{n})$ as follows.

Corollary 2.

Suppose that the assumptions of Theorem 1 hold. The UMPR $\tilde{f}_{n}$ constructed based on the penalty $C^{\text{VC}}_{n}$ with tuning parameter $\alpha_{0}$ satisfies

[TABLE]

for any distribution of $(Y,X)$ such that $\lim_{k\to\infty}S_{k}^{*}=S^{*}$ .

Corollary 2 shows that the UMPR $\tilde{f}_{n}$ with the VC penalty is universally utility consistent because the almost sure convergence holds for every distribution of $(Y,X)$ satisfying $\lim_{k\to\infty}S_{k}^{*}=S^{*}$ . To check this convergence of approximation error for some function classes $\{\mathcal{F}_{k}\}_{k=1}^{\infty}$ , we prove the following proposition.

Proposition 2.

Suppose that Assumptions (A1) and (A2) hold. For any (measurable) deterministic function $f:\mathcal{X}\mapsto\mathbb{R}$ , we have

[TABLE]

and the maximal expected utility $S^{*}$ satisfies

[TABLE]

If, in addition, Assumption (A3) holds, then

[TABLE]

for any deterministic function $f$ .

This proposition implies that for each $k\in\mathbb{N}$ ,

[TABLE]

If we specify $\mathcal{F}_{k}$ as the class of polynomial transformations on $\mathcal{X}$ of order at most $k$ , then the Stone-Weierstrass approximation theorem ensures that $\inf_{f\in\mathcal{F}_{k}}\sup_{x\in\mathcal{X}}|f(x)-p^{*}(x)|$ converges to zero as $k$ tends to infinity whenever $p^{*}$ is continuous on the support $\mathcal{X}$ that is a compact subset of $\mathbb{R}^{d}$ . Moreover, if each $r$ -th order partial derivative of $f:\mathcal{X}\to\mathbb{R}$ exists and is continuous on $\mathcal{X}$ for all $r\leq s\in\mathbb{N}$ , and $\mathcal{X}$ is compact, then the multivariate Jackson theorem of Bagby et al. (2002) implies that $\inf_{f\in\mathcal{F}_{k}}\sup_{x\in\mathcal{X}}|f(x)-p^{*}(x)|=\operatorname{O}\left(k^{-s}\right)$ . Rather than evaluating the global approximation to $p^{*}$ , Elliott and Lieli (2013) illustrate some preferences and data generating processes of $(Y,X)$ in which finite order polynomial functions in $X$ can completely replicate the crossing points between $p^{*}(x)$ and $c(x)$ ; more precisely, there is some polynomial function $f_{0}$ with sufficient order such that $\text{sign}(f_{0}(x)-c(x))=\text{sign}(p^{*}(x)-c(x))$ and thus $S^{*}_{k}\equiv\sup_{f\in\mathcal{P}_{k}}S(f)=S^{*}$ for some $k\in\mathbb{N}$ by Proposition 2.

By explicitly expressing $S^{*}-S(f)$ for any nonrandom function $f$ , Proposition 2 also confirms Elliott and Lieli’s insight that the correct specification of sign $(p^{*}(x)-c(x))$ is sufficient to achieve the maximal expected utility. Furthermore, Proposition 2 extends the properties of the Bayes decision rule to the cost-sensitive case. In this case, the maximal expected utility $S^{*}$ depends on not only the distribution of $(Y,X)$ via the conditional probability $p^{*}$ but also the decision maker’s preference via the weight function $b$ and cutoff function $c$ . Corresponding results in the traditional binary classification can be found in Sections 2.4 and 2.5 of Devroye et al. (1996).

Remark 1.

The second part in the complexity penalty $C^{\text{VC}}_{n}(k;\alpha_{0})$ involves a technical term $((1+\alpha_{0})\log{\{V_{k,c}\}}/(2n))^{1/2}$ . Instead of using $(\log{\{k\}}/n)^{1/2}$ as in Bartlett et al. (2002), we replace $k$ with the VC dimension $V_{k,c}$ of $\mathcal{F}_{k,c}$ . For example, when $\mathcal{F}_{k}$ is a class of univariate polynomial functions of order at most $k$ , then $V_{k,c}=k+1$ . We also replace the constant $1$ with $(1+\alpha_{0})/2$ such that $\zeta(\alpha_{0})=\sum_{k=1}^{\infty}V_{k,c}^{-(1+\alpha_{0})}$ is summable. This condition may hold for different values of $\alpha$ . The selection of $\alpha_{0}$ by the cross-validation is discussed in Appendix A.

In practice, researchers may expect that only certain classes of functions are worth consideration. For example, domain knowledge could suggest that higher-order interactions should be of limited importance, as argued in Athey and Imbens (2019). In this case, the UMPR is selected from a few classes of functions, and selection of $\alpha_{0}$ is not an issue because the technical term $8M\chi_{n}(k;\alpha)$ can be removed from the complexity penalty.

Remark 2.

The proposed complexity-regularized approach can also be applied to variable selection problems, which recently have attracted much attention in the econometrics literature. To see this application, let $X=(X_{1},\dots,X_{d})^{\top}$ be a $d$ -dimensional vector of covariates and $\mathcal{F}_{\mathscr{V}}$ be the class of linear functions of covariates in a nonempty set $\mathscr{V}\subseteq\{X_{1},\dots,X_{d}\}$ with cardinality $|\mathscr{V}|$ . Instead of specifying nested models, we consider nonnested models in variable selection problems. For example, $\mathcal{F}_{\{X_{2}\}}=\{X_{2}\mapsto\beta_{0}+\beta_{1}X_{2}:(\beta_{0},\beta_{1})^{\top}\in\mathbb{R}^{2}\}$ , $\mathcal{F}_{\{X_{1},X_{3}\}}=\{(X_{1},X_{3})\mapsto\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{3}:(\beta_{0},\beta_{1},\beta_{2})^{\top}\in\mathbb{R}^{3}\}$ , and $\mathcal{F}_{\{X_{2}\}}$ is neither a subset nor a superset of $\mathcal{F}_{\{X_{1},X_{3}\}}$ . The VC complexity penalty of $\mathcal{F}_{\mathscr{V}}$ is

[TABLE]

The nonempty subset $\hat{\mathscr{V}}_{n}$ of $\{X_{1},\dots,X_{d}\}$ is selected if it has the largest associated complexity-penalized empirical utility among all classes in

[TABLE]

Specifically,

[TABLE]

where $\hat{f}_{\mathscr{V}}\in\arg\max_{f\in\mathcal{F}_{\mathscr{V}}}S_{n}(f)$ is a maximum utility estimator. The UMPR is defined as $\tilde{f}_{n}\equiv\hat{f}_{\hat{\mathscr{V}}_{n}}$ .

A non-asymptotic upper bound on the difference between the maximal expected utility $S^{*}$ and the generalized expected utility $\operatorname*{\mathbb{E}}[S(\tilde{f}_{n})]$ can still be established. Suppose that Assumptions (A1)-(A4) hold. It can be shown that for any $n\in\mathbb{N}$ ,

[TABLE]

where $S_{\mathscr{V}}^{*}\equiv\sup_{f\in\mathcal{F}_{\mathscr{V}}}S(f)$ for every $\mathscr{V}$ . The derivation details are omitted, as they are similar to the arguments in the proof of Theorem 1. Thus, if the approximation error $S^{*}-S_{\mathscr{V}}^{*}$ is equal to zero for some $\mathscr{V}$ and $d=\operatorname{o}\left(n/\log\{n\}\right)$ , then $S(\tilde{f}_{n})$ converges in mean and in probability to $S^{*}$ .

The upper bound in Proposition 1 is distribution-free in the sense that it is valid for any distribution of $(Y,X)$ . Since the distributional properties are ignored, this VC-type upper bound is generally loose. The looseness is even exacerbated by the replacement of the growth function with an upper bound via Sauer’s lemma. Although the distribution of $(Y,X)$ is unknown, its distributional information could be inferred from the sample. As shown in Lozano’s (2000) simulation results for the interval model selection problem, the data driven penalization can track the magnitude of overfitting better than the VC-type penalization. Thus, we expect that using data-dependent complexity penalties, instead of the distribution-free complexity penalty $C^{\text{VC}}_{n}(k;\alpha_{0})$ , will improve the predictive performance of the UMPR.

3.2 UMPR with a Data-Dependent Penalty

Heuristically, the magnitude of overfitting is bounded by $\max_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S^{\prime}_{n}(f)\right)$ ,151515 It can be shown that $\operatorname*{\mathbb{E}}[\sup_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S(f)\right)]\leq\operatorname*{\mathbb{E}}[\max_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S^{\prime}_{n}(f)\right)]$ by the common symmetrization argument. By McDiarmid’s (1989) inequality, there is a constant $c_{0}>0$ such that for any $\delta\in(0,1)$ , $\sup_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S(f)\right)-\max_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S^{\prime}_{n}(f)\right)\leq\sqrt{\ln\{1/\delta\}/c_{0}n}$ with probability at least $1-\delta$ .

where $S^{\prime}_{n}(f)$ is the empirical utility of $f$ based on the ghost sample $\mathscr{D}^{\prime}_{n}$ , in which the observations $(Y^{\prime}_{1},X^{\prime}_{1}),\ldots,(Y^{\prime}_{n},X^{\prime}_{n})$ are distributed as $(Y_{1},X_{1}),\ldots,(Y_{n},X_{n})$ and independent of them. Although the lack of the ghost sample $\mathscr{D}^{\prime}_{n}$ invalidates the direct estimation of $\max_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S^{\prime}_{n}(f)\right)$ , this idea allows us to develop data-dependent complexity penalties. Each of them, similar to the VC counterpart, is the sum of a technical term with $\chi_{n}(k;\alpha)$ and an estimate of $\max_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S^{\prime}_{n}(f)\right)$ . Different estimates generate different complexity penalties as follows.

*Maximal Discrepancy *(MD)

We partition the sample into two nonoverlapping and roughly equal-sized subsamples. For notational simplicity, suppose the sample $\mathscr{D}_{n}$ is partitioned into two subsamples $\mathscr{D}^{(1)}_{n/2}=\{(Y_{i},X_{i})\}_{i=1}^{n/2}$ and $\mathscr{D}^{(2)}_{n/2}=\{(Y_{i},X_{i})\}_{i=n/2+1}^{n}$ , where the sample size $n$ is even. We define the maximal discrepancy complexity penalty to be

[TABLE]

as if $\mathscr{D}^{(1)}_{n/2}$ and $\mathscr{D}^{(2)}_{n/2}$ were the sample and the ghost sample, respectively. The penalization by maximal discrepancy is proposed by Bartlett et al. (2002) in the traditional binary classification. We expect the maximal discrepancy complexity penalty is an appropriate estimate of $\max_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S^{\prime}_{n}(f)\right)$ if the sample size is large. 2. 2.

*Simulated Maximal Discrepancy *(SMD)

We can pair up observations between two tentatively pre-specified subsamples, randomly exchange the subsample labels of paired observations, and calculate more maximal discrepancy complexity penalty terms. Repeating the random exchange mechanism $m$ times for the pre-specified subsamples $\{(Y_{2i-1},X_{2i-1})\}_{i=1}^{n/2}$ and $\{(Y_{2i},X_{2i})\}_{i=1}^{n/2}$ yields the simulated maximal discrepancy complexity penalty

[TABLE]

where $\{\sigma^{(j)}\}_{j=1}^{m}=\{(\sigma^{(j)}_{1},\sigma^{(j)}_{2},\ldots,\sigma^{(j)}_{n/2})\}_{j=1}^{m}$ is the collection of i.i.d. Rademacher random vectors (i.e., $\mathbb{P}(\sigma^{(j)}_{i}=1)=\mathbb{P}(\sigma^{(j)}_{i}=-1)=1/2$ ) that are independent of $\mathscr{D}_{n}$ , and $\gamma_{m,n}$ is a deterministic function that satisfies

[TABLE]

We need $\gamma_{m,n}$ to control the extra randomness introduced by the simulated random vectors. Conceptually, we could set $\gamma_{m,n}(M)=24M$ as in the MD penalty if $m=\infty$ , the case in which the extra randomness is eliminated. 3. 3.

*Rademacher Complexity *(RC)

If the ghost sample $\mathscr{D}^{\prime}_{n}$ were at hand, the pairing and exchange mechanism could be applied to $\mathscr{D}_{n}$ and $\mathscr{D}^{\prime}_{n}$ . Suppose we draw a sequence $(\sigma_{1},\sigma_{2},\ldots,\sigma_{n})$ of i.i.d. Rademacher random variables that are independent of $\mathscr{D}_{n}$ and $\mathscr{D}^{\prime}_{n}$ . Since observations are i.i.d., $\max_{f\in\mathcal{F}_{k}}\left(S_{n}(f)-S^{\prime}_{n}(f)\right)$ is identically distributed as

[TABLE]

which has expectation bounded above by the Rademacher complexity

[TABLE]

We can consider the simulated Rademacher complexity penalty

[TABLE]

where $\{\sigma^{(j)}\}_{j=1}^{m}=\{(\sigma^{(j)}_{1},\sigma^{(j)}_{2},\ldots,\sigma^{(j)}_{n})\}_{j=1}^{m}$ is the collection of i.i.d. Rademacher random vectors that are independent of $\mathscr{D}_{n}$ , and $\gamma_{m,n}$ is given in (5). Proposed by Koltchinskii (2001) and Bartlett et al. (2002) in the traditional binary classification, the Rademacher complexity and its variants are usually applied to complexity regularization; see for example Koltchinskii (2011). 4. 4.

*Bootstrap Complexity *(BC)

Following Fromont’s (2007) idea, we can apply Efron’s (1979) bootstrap to the construction of complexity penalty by replacing the Rademacher random variables with the multinomial random weights minus one. Specifically, we treat the bootstrap complexity penalty as

[TABLE]

where $\{W^{(j)}_{n}\}_{j=1}^{m}=\{(W^{(j)}_{n,1},W^{(j)}_{n,2},\ldots,W^{(j)}_{n,n})\}_{j=1}^{m}$ is the collection of i.i.d. multinomial vectors with parameters $n$ and $(1/n,1/n,\ldots,1/n)$ such that $\{W^{(j)}_{n}\}_{j=1}^{m}$ is independent of $\mathscr{D}_{n}$ , and

[TABLE]

To study the performance of the UMPR $\tilde{f}_{n}$ with each of these data-dependent complexity penalties, we evaluate the difference between the generalized expected utility $\operatorname*{\mathbb{E}}[S(\tilde{f}_{n})]$ and the maximal expected utility $S^{*}$ . The upper bounds on $S^{*}-\operatorname*{\mathbb{E}}[S(\tilde{f}_{n})]$ in Theorem 2, resembling the VC counterpart in Theorem 1, have a similar trade-off between the associated expected complexity penalty and the approximation error. Note that the data-dependent complexity penalties are all random, whereas the VC complexity penalty is deterministic.

Theorem 2.

Let $\gamma_{m,n}$ and $\gamma^{\prime}_{m,n}$ be the functions given in (5) and (6), respectively, where $m$ is the number of simulation replications for SMD, RC, and BC penalties. Suppose that (i) the data $\mathscr{D}_{n}=\{(Y_{i},X_{i})\}_{i=1}^{n}$ are i.i.d., (ii) Assumptions (A1)-(A4) hold, (iii) $\mathcal{F}_{k}$ is a VC-subgraph class for each $k$ , and (iv) $\zeta(\alpha_{0})<\infty$ for some $\alpha_{0}$ .

If the UMPR $\tilde{f}_{n}$ is constructed based on the penalty $C^{\text{MD}}_{n}$ with tuning parameter $\alpha_{0}$ , then we have for any $n\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

and

[TABLE] 2. 2.

If the UMPR $\tilde{f}_{n}$ is constructed based on the penalty $C^{\text{SMD}}_{n}$ with tuning parameter $\alpha_{0}$ , then we have for any $n\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

and

[TABLE] 3. 3.

If the UMPR $\tilde{f}_{n}$ is constructed based on the penalty $C_{n}^{\text{RC}}$ with tuning parameter $\alpha_{0}$ , then we have for any $n\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

and

[TABLE] 4. 4.

If the UMPR $\tilde{f}_{n}$ is constructed based on the penalty $C_{n}^{\text{BC}}$ with tuning parameter $\alpha_{0}$ , then we have for any integer $n\geq 2$ and $\varepsilon>0$ ,

[TABLE]

and

[TABLE]

We can show that if the ratio $m/n$ is bounded away from zero, then the expected value of each data-dependent complexity penalty in this section shrinks to zero at the rate $\operatorname{O}\left(n^{-1/2}\right)$ , which is slightly faster than the convergence rate of the VC complexity penalty. Hence, if the approximation error $S^{*}-S^{*}_{k}$ is equal to zero for some $k$ , then $S^{*}-\operatorname*{\mathbb{E}}[S(\tilde{f}_{n})]=\operatorname{O}\left(n^{-1/2}\right)$ . Under the same assumptions, we can also demonstrate the universal utility consistency of the UMPR $\tilde{f}_{n}$ with any data-dependent complexity penalty above. These results are summarized in Corollary 3.

Corollary 3.

Suppose that the assumptions of Theorem 2 hold. If, in addition, $m/n\geq 1/\bar{\ell}^{2}$ for some positive integer $\bar{\ell}$ , then there are positive constants $\kappa_{1}$ and $\kappa_{2}$ only depending on $M$ , and $\kappa_{3}$ depending on $(M,\bar{\ell})$ such that for each $k\in\mathbb{N}$ and $n\geq 8$ ,

[TABLE]

Moreover, the UMPR $\tilde{f}_{n}$ constructed based on any aforementioned data-dependent complexity penalty with tuning parameter $\alpha_{0}$ satisfies

[TABLE]

for any distribution of $(Y,X)$ such that $\lim_{k\to\infty}S_{k}^{*}=S^{*}$ .

4 Simulation

To study the finite-sample performance of the UMPR with any complexity penalty in the previous section, we carried out Monte Carlo experiments. The simulation designs are those in Elliott and Lieli (2013). Specifically, we consider two data generating processes:

DGP 1

The covariate $X$ follows the distribution $5\cdot\text{beta}(1,1.3)-2.5$ and $p^{*}(X)=\Lambda(-0.5X+0.2X^{3})$ ; 2. DGP 2

Both covariates $X_{1}$ and $X_{2}$ are independent and uniformly distributed on $[-3.5,3.5]$ and $p^{*}(X_{1},X_{2})=\Lambda(Q(1.5X_{1}+1.5X_{2}))$ , where $Q(v)=(1.5-0.1v)\exp\{-(0.25v+0.1v^{2}-0.04v^{3})\}$ .

In addition, we consider four preferences:

Preference 1

$b(X)=20$ and $c(X)=0.5$ ; 2. Preference 2

$b(X)=20$ and $c(X)=0.5+0.025X$ ; 3. Preference 3

$b(X_{1},X_{2})=20$ and $c(X_{1},X_{2})=0.75$ ; 4. Preference 4

$b(X_{1},X_{2})=20+40\cdot\mathbbm{1}_{[|X_{1}+X_{2}|<1.5]}$ and $c(X_{1},X_{2})=0.75$ .

The first two preferences are associated with DGP 1, whereas the last two preferences are associated with DGP 2. For DGP 1 together with either preference 1 or 2, not only the cubic ML but also the cubic MU are correctly specified because there are three crossing points between the conditional probability $p^{*}$ and the cutoff function $c$ in the support of $X$ .161616 By the cubic MU, we mean that the MU optimization is taken over the class $\mathcal{P}_{3}$ of polynomial transformations of order at most 3. Similarly, we refer to the cubic ML as the maximum likelihood estimation with optimization taken over the class $\Lambda(\mathcal{P}_{3})$ .

Although any logit model is misspecified for DGP 2, Elliott and Lieli (2013) demonstrate that the cubic MU is correctly specified in the sense that for all $x\in\mathcal{X}$ , $\text{sign}(f(x))=\text{sign}(p^{*}(x)-c(x))$ for some $f\in\mathcal{P}_{3}$ .

We evaluate different selection methods for the cost-sensitive binary classification. In addition to the proposed UMPR with different complexity penalties, we study the pretest estimator adopted by Elliott and Lieli (2013) and a tenfold cross-validatory estimator in the maximum utility estimation. For the MU, UMPR, pretest and cross-validatory estimators, we specify the hierarchy $\{\mathcal{F}_{k}\}_{k=1}^{\infty}$ of classes as $\mathcal{F}_{k}=\mathcal{P}_{k}$ for $k\in\{1,2\}$ and $\mathcal{F}_{k}=\mathcal{P}_{3}$ for all $k\geq 3$ . Moreover, we compare prediction rules based on penalized empirical utility criteria with those based on penalized likelihood criteria. We consider the UMPR for the former criteria, but the AIC and BIC for the latter criteria. For the ML, AIC and BIC, we specify the hierarchy $\{\mathcal{F}_{k}\}_{k=1}^{\infty}$ of classes as $\mathcal{F}_{k}=\Lambda(\mathcal{P}_{k})$ for $k\in\{1,2\}$ and $\mathcal{F}_{k}=\Lambda(\mathcal{P}_{3})$ for all $k\geq 3$ . We also compute the tenfold cross-validatory LASSO (i.e., logistic loss with an $\ell_{1}$ penalty) with optimization taken over the class $\Lambda(\mathcal{P}_{3})$ and $\ell_{1}$ -norm SVM (i.e., hinge loss with an $\ell_{1}$ penalty) with optimization taken over the class $\mathcal{P}_{3}$ . The steps of pretesting and cross-validation in maximum utility estimation are described in Appendix B, whereas the implementation of LASSO and $\ell_{1}$ -norm SVM can be found in Efron and Hastie (2016) and Fung and Mangasarian (2004), respectively.171717 As suggested for the $\ell_{2}$ -norm SVM in Shawe-Taylor and Cristianini (2004), we construct the $\ell_{1}$ -norm SVM estimator $\hat{f}_{\text{SVM}}$ based on standardized covariates in the following Monte Carlo experiments. Additionally, we use the logistic transformation $\tilde{f}_{\text{SVM}}\equiv\Lambda(\hat{f}_{\text{SVM}})$ to evaluate its predictive performance. This transformation not only makes $\tilde{f}_{\text{SVM}}$ comparable with other competing estimators in binary classification but also maintains the classification rule of $\hat{f}_{\text{SVM}}$ because $\text{sign}(\hat{f}_{\text{SVM}}(x))=\text{sign}(\tilde{f}_{\text{SVM}}(x)-c(x))$ provided $c(x)=1/2$ for all $x\in\mathcal{X}$ . To evaluate the performance of a prediction rule $f^{{\dagger}}_{n}$ , we compute its relative generalized expected utility

[TABLE]

which can be approximated via simulation because

[TABLE]

where $S_{\ell,j}(f^{{\dagger}}_{n}|\mathscr{D}_{n,j})$ is the $j$ -th out-of-sample empirical utility with size $\ell$ of $f^{{\dagger}}_{n}$ , constructed by the $j$ -th training data $\mathscr{D}_{n,j}$ with size $n$ , $S_{\ell,j}(p^{*})$ is the $j$ -th out-of-sample empirical utility with size $\ell$ of $p^{*}$ , and $\mathcal{S}$ is the number of simulation replications. In the following experiments, we set $n\in\{500,1000\}$ , $\ell=5000$ , and $\mathcal{S}=500$ ; additionally, we take $m=10$ for the SMD, RC, and BC penalties.

Table 1 presents the relative generalized expected utility of ML, MU, and UMPR with VC and MD complexity penalties under different designs when $n=500$ and $n=1000$ . As expected, a correctly specified ML achieves the largest relative generalized expected utility among these estimators for DGP 1. However, a misspecified ML, compared with MU and UMPR, usually has the worst performance. In addition, for each tuning parameter $\alpha\in\{1,0.5,0.1,0.05\}$ , the UMPR with MD penalty outperforms its VC counterpart for DGP 1, but the dominance is not clear for DGP 2. As $\alpha$ decreases (i.e., the technical term is smaller), the UMPR with MD penalty might have slightly larger relative generalized expected utility. However, the UMPR with VC penalty has the same relative generalized expected utility as the linear MU for DGP 2. This is a caveat that the correctly specified cubic MU is never selected out of 500 simulation replications by the UMPR with VC penalty. This phenomenon arises probably because the distribution-free complexity penalty used to construct the UMPR is too conservative. Using tenfold cross-validated $\hat{\alpha}$ selected from $\{1,0.5,0.1,0.05\}$ does not improve the performance of the UMPR with VC and MD penalties; however, excluding the technical term ( $8M\chi_{n}(k;\alpha)$ and $24M\chi_{n}(k;\alpha)$ for VC and MD penalty, respectively) would yield an increase in relative generalized expected utility. These results imply that the UMPR with MD penalty is more adept at selecting the MU estimator with the largest utility than its VC counterpart. Thus, we focus on the UMPR with data-dependent complexity penalties exclusive of the technical term hereafter.

In addition to the comparison between MU and ML, we compare prediction rules based on penalized empirical utility criteria with those based on penalized likelihood criteria. Table 2 reports the relative generalized expected utility of UMPR, AIC, BIC, LASSO, and $\ell_{1}$ -norm SVM under different designs when $n=500$ and $n=1000$ . We see that the performance of the UMPR relies on the choice of data-dependent complexity penalties. Among these penalties, SMD, RC, and BC might be better than MD in terms of relative generalized expected utility. We also see that the AIC and BIC outweigh the UMPR for DGP 1, and this reflects the consistent selection of the cubic ML by the AIC and BIC, a property shown in Sin and White (1996). However, the AIC and BIC, selecting a mispecified logit model by the penalized likelihood, are dominated by the UMPR for DGP 2. Furthermore, compared with the AIC and BIC, the LASSO has poorer performance for DGP 1, but almost the same performance for DGP 2. Outweighing the LASSO for DGP 1, the SVM has the worst performance for DGP 2. Such bad performance could be attributed to the fact that the SVM aims to recover $\text{sign}(p^{*}(x)-1/2)$ rather than $\text{sign}(p^{*}(x)-c(x))$ , but the cutoff function $c(x)$ is markedly different from $1/2$ for DGP 2. More importantly, when the number of in-sample observations increases from $n=500$ to $n=1000$ , the relative generalized expected utility of UMPR increases for all designs. This phenomenon is guaranteed by Theorem 2 and Corollary 3 because the approximation error $S^{*}-S^{*}_{3}$ is equal to zero. In contrast, for DGP 2 in which any ML is misspecified, a larger sample size does not improve the relative generalized expected utility of AIC, BIC, LASSO, and SVM. These results demonstrate that the UMPR inherits the robustness of MU estimation, a feature that selection methods based on penalized likelihood criteria do not possess in general.

Finally, we compare the proposed UMPR with two pretest estimators, including a specific-to-general approach and a general-to-specific approach, and the tenfold cross-validatory estimator in maximum utility estimation. Table LABEL:Table3 provides the relative generalized expected utility and the percentage of models selected out of 500 simulation replications when $n=500$ and $n=1000$ . Differences in the selection frequencies for the pretesting are shown across the left and right panels because Elliott and Lieli’s (2013) test statistic depends on the preference specifications. In terms of selecting the correctly specified cubic MU, the UMPR with VC penalty, either inclusive or exclusive of a technical term, performs worst under all designs. As can be seen, the cross-validatory estimator, in comparison with the UMPR, attains higher percentages of selecting the cubic MU under all designs. Thus, the cross validation might be preferable if we attempt to select the model correctly specified in the maximum utility estimation. However, if the goal is to capture the largest expected utility, we prefer the UMPR with data-dependent penalties to the pretest and cross-validatory estimators because the proposed penalty-based prediction rules perform better than the other estimators in terms of the relative generalized expected utility, as suggested by the experimental evidence.

5 Conclusion

The maximum utility estimation can be viewed as the binary classification with a decision-based utility function. Despite its possible improvement in utility over traditional maximum likelihood methods, the maximum utility estimation has inherited the in-sample overfitting from the perceptron learning.

To alleviate the in-sample overfitting, we adopt the structural risk minimization approach to construct a utility-maximizing prediction rule. For complexity penalization, we consider the distribution-free VC penalty and four data-dependent penalties (MD, SMD, RC, and BC). For each penalty, we show that the difference between the maximal expected utility and the generalized expected utility of the utility-maximizing prediction rule is bounded. The upper bounds are close to zero for a large sample if the approximation error is equal to zero for some pre-specified classes of functions. In general, we prefer the simulated complexity penalties in terms of predictive performance, as suggested by the simulation results. These simulation results also show that the utility-maximizing prediction rule with an appropriate data-dependent complexity penalty has better predictive performance than the pretest and cross-validatory estimators; more importantly, it outperforms the AIC, BIC, and LASSO if the conditional probability of the binary outcome is misspecified, and the $\ell_{1}$ -norm SVM if the cutoff function considerably deviates from $1/2$ . The utility-maximizing prediction rule is thus important for the decision-making based on binary prediction.

Appendix A Selection of $\alpha_{0}$

Given a pre-specified finite set $\mathscr{A}$ in which every element $\alpha$ satisfies $\zeta(\alpha)<\infty$ , we can select $\alpha_{0}=\hat{\alpha}\in\mathscr{A}$ by the $T$ -fold cross-validation method.181818 Although the optimal choice of $T$ is an open theoretical question, common choices of $T$ are 5 or 10, as suggested by Hastie et al. (2009). We randomly partition the data $\mathscr{D}_{n}$ into $T$ roughly equal-sized sets. Let $\tau:\{1,2,\ldots,n\}\to\{1,2,\ldots,T\}$ be the indexing function such that the observation $(Y_{i},X_{i})$ is in the validation set $\tau(i)$ . We write $\mathscr{D}^{(-t)}_{n}$ for the data $\mathscr{D}_{n}$ from which the validation set $t$ is removed, and $n_{t}$ for the sample size of $\mathscr{D}^{(-t)}_{n}$ . Let $S^{(-t)}_{n}(f)$ be the empirical utility of $f$ calculated based on $\mathscr{D}^{(-t)}_{n}$ . The $T$ -fold cross-validation method in our framework is implemented as follows.

(1)

For each $t\in\{1,2,\ldots,T\}$ and $\alpha\in\mathscr{A}$ , we calculate the UMPR with tuning parameter $\alpha$ based on $\mathscr{D}^{(-t)}_{n}$ by $\tilde{f}^{(-t)}_{n}(\alpha)=\hat{f}^{(-t)}_{\hat{k}^{(-t)}_{n}(\alpha)}$ , where

[TABLE] 2. (2)

The cross-validated tuning parameter is defined as

[TABLE]

where

[TABLE] 3. (3)

We calculate the UMPR $\tilde{f}_{n}(\hat{\alpha})$ with cross-validated $\hat{\alpha}$ based on the whole data $\mathscr{D}_{n}$ .

Similarly, we can show that for any $n\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

and

[TABLE]

Appendix B Pretesting and Cross-Validation

B.1 Pretesting

We consider the null hypothesis $H^{(k)}_{0}:S^{*}_{(k-1)}=S^{*}_{k}$ against the alternative hypothesis $H^{(k)}_{1}:S^{*}_{(k-1)}<S^{*}_{k}$ for $k\in\{2,3\}$ . Elliott and Lieli (2013) propose a general-to-specific pretest estimator based on the test statistic developed in their Proposition 4. They suggest selecting the model $\mathcal{F}_{\hat{k}(\text{G}\to\text{S})}$ , where

[TABLE]

Similarly, we can apply a specific-to-general approach to selecting the model $\mathcal{F}_{\hat{k}(\text{S}\to\text{G})}$ , where

[TABLE]

For these two approaches, we conduct Elliott and Lieli’s test statistic with the size equal to 5% and auxiliary i.i.d. random variables that follow a Bernoulli(0.75) distribution.

B.2 Cross-Validation

We randomly partition the data $\mathscr{D}_{n}$ into $T$ roughly equal-sized sets. Let $\tau:\{1,2,\ldots,n\}\to\{1,2,\ldots,T\}$ be the indexing function such that the observation $(Y_{i},X_{i})$ is in the validation set $\tau(i)$ . We write $\mathscr{D}^{(-t)}_{n}$ for the data $\mathscr{D}_{n}$ from which the validation set $t$ is removed. The $T$ -fold cross-validation method in the maximum utility framework can be implemented as follows.

(1)

We consider an integer $K$ . For each $k\in\{1,2,\ldots,K\}$ and $t\in\{1,2,\ldots,T\}$ , we calculate the MU estimator based on $\mathscr{D}^{(-t)}_{n}$ by

[TABLE]

where $S^{(-t)}_{n}(f)$ is the empirical utility calculated by $f$ and $\mathscr{D}^{(-t)}_{n}$ ; that is,

[TABLE] 2. (2)

The cross-validated value of $k$ is defined as

[TABLE]

where

[TABLE] 3. (3)

The cross-validated MU estimator is the MU estimator selected from $\mathcal{F}_{\hat{k}_{n}}$ based on $\mathscr{D}_{n}$ ; specifically,

[TABLE]

Appendix C Technical Proofs

C.1 Proof of Proposition 1

Proof.

Since the mapping

[TABLE]

satisfies the bounded differences property in Section 6.1 of Boucheron et al. (2013) with their notation $c_{i}=8M/n$ for each $i\in\{1,\ldots,n\}$ , we have

[TABLE]

by McDiarmid’s (1989) inequality.

It is now sufficient to show that

[TABLE]

Let $\mathscr{D}^{\prime}_{n}$ be the ghost sample, in which the observations $(Y^{\prime}_{1},X^{\prime}_{1}),\ldots,(Y^{\prime}_{n},X^{\prime}_{n})$ are distributed as $(Y_{1},X_{1}),\ldots,(Y_{n},X_{n})$ and independent of them. We write $S^{\prime}_{n}(f)$ for the empirical utility of the prediction rule $f$ based on the ghost sample $\mathscr{D}^{\prime}_{n}$ . Let $\{\sigma_{i}\}_{i=1}^{n}$ be a sequence of i.i.d. Rademacher random variables that are independent of $\mathscr{D}_{n}$ ; that is, $\mathbb{P}(\sigma_{i}=1)=\mathbb{P}(\sigma_{i}=-1)=1/2$ . The common symmetrization argument shows that

[TABLE]

Let $\psi_{i}\equiv b(X_{i})[Y_{i}+1-2c(X_{i})]$ for each $i\in\{1,\ldots,n\}$ . Lemma 26.9 of Shalev-Shwartz and Ben-David (2014) implies that

[TABLE]

Applying Lemma 5.2 of Massart (2000) yields

[TABLE]

It follows that

[TABLE]

Combining Inequalities (C.1) and (C.2) completes the proof. ∎

C.2 Proof of Corollary 1

Proof.

Since the growth function $\Pi_{k,c}$ is of polynomial order for each $k\in\mathbb{N}$ , there exists an integer $n^{*}$ such that

[TABLE]

for all $n\geq n^{*}$ . Following the argument in Proposition 1 mutatis mutandis, we have

[TABLE]

Hence, for all $n\geq n^{*}$ ,

[TABLE]

∎

C.3 Proof of Theorem 1

Proof.

Part 1

By construction, we have

[TABLE]

for each $j\in\mathbb{N}$ . So, for any $n\in\mathbb{N}$ and $\varepsilon>0$ ,

[TABLE]

It follows by Inequality (3.1) that

[TABLE]

Part 2

By Part 1 and Lemma 1, we have

[TABLE]

In addition, for each $k\in\mathbb{N}$ ,

[TABLE]

because $\operatorname*{\mathbb{E}}[S_{n}(f_{k}^{*})]=S_{k}^{*}$ and $f_{k}^{*}\in\arg\max_{f\in\mathcal{F}_{k}}S(f)$ . So, for each $k\in\mathbb{N}$ ,

[TABLE]

It follows that

[TABLE]

∎

C.4 Proof of Corollary 2

Proof.

Fix an $\varepsilon>0$ . Choose an integer $k_{0}=k_{0}(\varepsilon)$ such that $S^{*}-\sup_{f\in\mathcal{F}_{k}}S(f)<\varepsilon/3$ for all $k\geq k_{0}$ . In addition, choose an integer $n_{0}=n_{0}(\varepsilon,k_{0})$ such that $C^{\text{VC}}_{n}(k_{0};\alpha_{0})<\varepsilon/6$ for all $n\geq n_{0}$ . For each $n\in\mathbb{N}$ ,

[TABLE]

It follows from Theorem 1 that the second term in the right hand side is bounded above by $A_{1}\exp{\left\{-B_{1}n\varepsilon^{2}\right\}}$ for some positive constants $A_{1}$ and $B_{1}$ . By Corollary 1, there exist an integer $n_{1}=n_{1}(\varepsilon,k_{0})$ and two constants $A_{2}$ and $B_{2}$ such that

[TABLE]

for all $n\geq n_{1}$ . Hence, for any $n\geq n_{2}\equiv\max\{n_{0},n_{1}\}$ , we have

[TABLE]

Therefore, we have

[TABLE]

Applying the Borel-Cantelli lemma yields the statement. ∎

C.5 Proof of Proposition 2

Proof.

We first study $S^{*}-S(f)$ . Let $b_{1}(x)=u_{1,1}(x)-u_{-1,1}(x)$ and $b_{-1}(x)=u_{-1,-1}(x)-u_{1,-1}(x)$ for all $x\in\mathcal{X}$ . For any $f$ , we have that with probability one,

[TABLE]

It follows from (C.5) that with probability one,

[TABLE]

Note that for any $f$ , we have

[TABLE]

Combining (C.5) and (C.5) yields

[TABLE]

for any $f$ . It immediately follows that $S^{*}=S(p^{*})$ .

Next, we calculate the maximal expected utility $S^{*}$ . The derivation in (C.5) implies that with probability one,

[TABLE]

After rearrangement, we obtain

[TABLE]

Taking expectation on both sides yields $S^{*}=S(p^{*})=2\operatorname*{\mathbb{E}}\left[b(X)|p^{*}(X)-c(X)|\right]$ by (C.5).

Finally, note that $|p^{*}(X)-c(X)|\leq|p^{*}(X)-f(X)|$ whenever $\mathbbm{1}_{[p^{*}(X)\geq c(X)]}\neq\mathbbm{1}_{[f(X)\geq c(X)]}$ . It follows that

[TABLE]

Hence, we obtain $S^{*}-S(f)\leq 16M\sup_{x\in\mathcal{X}}|p^{*}(x)-f(x)|$ by Assumption (A3). ∎

C.6 Proof of Theorem 2

Proof.

Part 1

We write $S^{\prime}_{n}(f)$ for the empirical utility of the prediction rule $f$ based on the ghost sample $\mathscr{D}^{\prime}_{n}$ , in which the observations $(Y^{\prime}_{1},X^{\prime}_{1}),\ldots,(Y^{\prime}_{n},X^{\prime}_{n})$ are distributed as $(Y_{1},X_{1}),\ldots,(Y_{n},X_{n})$ and independent of them. For ease of notation, let

[TABLE]

As in the proof of Theorem 1, the desired results follow from the exponential tail inequality

[TABLE]

where $R^{\text{MD}}_{n,k}\equiv S_{n}(\hat{f}_{k})-\max_{f\in\mathcal{F}_{k}}(S^{(1)}_{n}(f)-S^{(2)}_{n}(f))$ . To establish this tail inequality, we note that if

[TABLE]

then

[TABLE]

It follows from McDiarmid’s (1989) inequality that the latter probability is bounded above by $\exp{\left\{-2n\varepsilon^{2}/(24M)^{2}\right\}}=\exp{\left\{-n\varepsilon^{2}/288M^{2}\right\}}$ because the mapping

[TABLE]

satisfies the bounded differences property in Section 6.1 of Boucheron et al. (2013) with their notation $c_{i}=24M/n$ for each $i\in\{1,\ldots,n\}$ .

It remains to prove Inequality (C.6). Since $\{(Y^{\prime}_{i},X^{\prime}_{i})\}_{i=1}^{n/2}$ , $\{(Y^{\prime}_{i},X^{\prime}_{i})\}_{i=n/2+1}^{n}$ , $\{(Y_{i},X_{i})\}_{i=1}^{n/2}$ , and $\{(Y_{i},X_{i})\}_{i=n/2+1}^{n}$ are independent and identically distributed, the common symmetrization argument shows that

[TABLE]

Part 2

For ease of notation, let $\sigma\equiv\{\sigma^{(j)}\}_{j=1}^{m}=\{(\sigma^{(j)}_{1},\sigma^{(j)}_{2},\ldots,\sigma^{(j)}_{n/2})\}_{j=1}^{m}$ ,

[TABLE]

and $R^{\text{SMD}}_{n,k}\equiv S_{n}(\hat{f}_{k})-Q^{\text{SMD}}_{n,k}(\sigma;\mathscr{D}_{n})$ . For every $m,n\in\mathbb{N}$ , let

[TABLE]

The proof is similar to that of Part 1 in the sense that our goal is to establish an appropriate exponential tail inequality of $R^{\text{SMD}}_{n,k}-S(\hat{f}_{k})$ . The additional trick is to deal with the randomness arising from simulated Rademacher random vectors. To disentangle such randomness from the randomness of data $\mathscr{D}_{n}$ , we consider the inequality

[TABLE]

for any $\eta\in(0,1)$ . Given $\mathscr{D}_{n}$ , the mapping $\{\sigma^{(j)}\}_{j=1}^{m}\mapsto Q^{\text{SMD}}_{n,k}(\sigma;\mathscr{D}_{n})$ satisfies the bounded differences property in Section 6.1 of Boucheron et al. (2013) with their notation $c_{j}=16M/m$ for each $j\in\{1,\ldots,m\}$ . It follows from McDiarmid’s (1989) inequality that

[TABLE]

with probability one. Taking expectation with respect to $\mathscr{D}_{n}$ on both sides yields

[TABLE]

In addition, since the observations in $\mathscr{D}_{n}$ and $\mathscr{D}^{\prime}_{n}$ are i.i.d., the common symmetrization argument shows that

[TABLE]

We apply McDiarmid’s (1989) inequality again and obtain

[TABLE]

because the mapping

[TABLE]

satisfies the bounded differences property in Section 6.1 of Boucheron et al. (2013) with their notation $c_{i}=24M/n$ for each $i\in\{1,\ldots,n\}$ . Combining Inequalities (C.7)-(C.6) and setting $\eta=\eta_{m,n}$ , we obtain

[TABLE]

The desired results follow from the exponential tail probability above and similar arguments used in Theorem 1.

Part 3

Let $\sigma\equiv\{\sigma^{(j)}\}_{j=1}^{m}=\{(\sigma^{(j)}_{1},\sigma^{(j)}_{2},\ldots,\sigma^{(j)}_{n})\}_{j=1}^{m}$ and $R^{\text{RC}}_{n,k}\equiv S_{n}(\hat{f}_{k})-Q^{\text{RC}}_{n,k}(\sigma;\mathscr{D}_{n})$ , where

[TABLE]

Following the argument in Part 2 mutatis mutandis, we have for any $\eta\in(0,1)$ ,

[TABLE]

and

[TABLE]

Combining these two inequalities with $\eta=\eta_{m,n}$ yields

[TABLE]

The desired results follow from the exponential tail probability above and similar arguments used in Theorem 1.

Part 4

For ease of notation, let $W\equiv\{W^{(j)}_{n}\}_{j=1}^{m}=\{(W^{(j)}_{n,1},W^{(j)}_{n,2},\ldots,W^{(j)}_{n,n})\}_{j=1}^{m}$ ,

[TABLE]

and $R^{\text{BC}}_{n,k}\equiv S_{n}(\hat{f}_{k})-Q^{\text{BC}}_{n,k}(W;\mathscr{D}_{n})$ .

As in the proof of Part 2, the desired results follow from the exponential tail inequality

[TABLE]

To establish this tail probability, we have

[TABLE]

for any $\eta\in(0,1)$ . Given $\mathscr{D}_{n}$ , the mapping $\{W_{n}^{(j)}\}_{j=1}^{m}\mapsto Q^{\text{BC}}_{n,k}(W;\mathscr{D}_{n})$ satisfies the bounded differences property in Section 6.1 of Boucheron et al. (2013) with their notation $c_{j}=(8M/m)(n/(n-1))^{n}$ for each $j\in\{1,\ldots,m\}$ . It follows from McDiarmid’s (1989) inequality that

[TABLE]

with probability one. Taking expectation with respect to $\mathscr{D}_{n}$ on both sides yields

[TABLE]

because $((n-1)/n)^{n}\geq 1/4$ for all $n\geq 2$ . If we have

[TABLE]

then we can apply McDiarmid’s (1989) inequality again and obtain

[TABLE]

because the mapping

[TABLE]

satisfies the bounded differences property in Section 6.1 of Boucheron et al. (2013) with their notation

[TABLE]

for each $i\in\{1,\ldots,n\}$ . By Lemma 2,

[TABLE]

It follows that

[TABLE]

Combining Inequalities (C.11)-(C.14) and setting

[TABLE]

yield Inequality (C.10).

Now, it suffices to show Inequality (C.13). Note that

[TABLE]

Applying Jensen’s inequality to the numerator, we have

[TABLE]

Let $\mathscr{D}_{n,\geq 1}$ be the largest subset of $\mathscr{D}_{n}$ such that each observation $(Y_{i},X_{i})$ in $\mathscr{D}_{n,\geq 1}$ has the concomitant $W_{n,i}$ greater than or equal to $1$ . Note that

[TABLE]

and

[TABLE]

It follows that

[TABLE]

Combining (C.15) and (C.16) yields

[TABLE]

Lemma 2 shows that $\operatorname*{\mathbb{E}}\left[(W_{n,1}-1)_{+}\right]=\left(1-1/n\right)^{n}$ . Therefore, we obtain

[TABLE]

∎

C.7 Proof of Corollary 3

Proof.

We first consider the MD, SMD, and RC penalties. Since

[TABLE]

and

[TABLE]

it suffices to find an appropriate upper bound on $\operatorname*{\mathbb{E}}\left[\max_{f\in\mathcal{F}_{k}}\frac{1}{n}\sum_{i=1}^{n/2}\sigma_{i}s(Y_{i},X_{i},f)\right]$ . Let $Z_{i}=\sigma_{i}b(X_{i})[Y_{i}+1-2c(X_{i})]$ for each $i$ . We have $\operatorname*{\mathbb{E}}[Z_{i}|\mathscr{D}_{n}]=0$ for each $i$ and $\operatorname*{\mathbb{E}}[|Z_{1}|^{\ell}|\mathscr{D}_{n}]\leq(\ell!/2)(4M)^{\ell}$ for each $\ell\geq 2$ . Notice that

[TABLE]

where $\mathbb{A}_{k,c}(\mathscr{D}_{n})\equiv\{(\mathbbm{1}_{B}(X_{1}),\ldots,\mathbbm{1}_{B}(X_{n})):B\in\mathcal{B}_{k,c}\}$ and $\mathcal{B}_{k,c}\equiv\{\{x\in\mathcal{X}:f(x)-c(x)\geq 0\}:f\in\mathcal{F}_{k}\}$ . Note that the VC dimension of $\mathcal{B}_{k,c}$ is $V_{k,c}$ . It follows by the chaining technique in Lemma 3 and Theorem 4 of Fromont (2007) that there are positive constants $\bar{\kappa}_{1}$ and $\bar{\kappa}_{2}$ , only depending on $M$ , such that

[TABLE]

for each $k\in\mathbb{N}$ and $n\geq 8$ . Hence, taking expectation on both sides, we obtain

[TABLE]

Next, we consider the BC penalty. As in the proof of Fromont (2007), we apply Poissonization to remove the dependence of $(W_{n,1},\ldots,W_{n,n})$ . Let $\{U_{i}\}_{i=1}^{n}$ be a sequence of i.i.d. random variables independent of $\mathscr{D}_{n}$ and uniformly distributed on $(0,1)$ such that

[TABLE]

Let $N$ be the Poisson random variable with parameter $n$ that is independent of $\mathscr{D}_{n}$ and $\{U_{i}\}_{i=1}^{n}$ . For each $i\in\{1,\ldots,n\}$ , we define

[TABLE]

It can be shown that $\{N_{i}\}_{i=1}^{n}$ is a sequence of i.i.d. random variables independent of $\mathscr{D}_{n}$ and each $N_{i}$ follows a Poisson distribution with parameter 1; additionally,

[TABLE]

Let $\tilde{Z}_{i}=(N_{i}-1)b(X_{i})[Y_{i}+1-2c(X_{i})]$ for each $i$ . As in the proof of Fromont (2007), we have $\operatorname*{\mathbb{E}}[\tilde{Z}_{i}|\mathscr{D}_{n}]=0$ for each $i$ and $\operatorname*{\mathbb{E}}[|\tilde{Z}_{1}|^{\ell}|\mathscr{D}_{n}]\leq 1.4\cdot(\ell!/2)(4M)^{\ell}$ for each $\ell\geq 2$ . Then there are positive constants $\tilde{\kappa}_{1}$ and $\tilde{\kappa}_{2}$ , only depending on $M$ , such that for each $k\in\mathbb{N}$ and $n\geq 8$ ,

[TABLE]

by the chaining technique in Lemma 3 and Theorem 4 of Fromont (2007). Therefore, taking expectation on both sides yields

[TABLE]

for each $k\in\mathbb{N}$ and $n\geq 8$ . Note that the technical term in BC penalty satisfies

[TABLE]

Finally, the proof is completed by showing the universal utility consistency. We have established that for each $k\in\mathbb{N}$ , $\operatorname*{\mathbb{E}}\left[C^{\text{MD}}_{n}(k;\alpha_{0})\right]=\operatorname{O}\left(n^{-1/2}\right)$ , $\operatorname*{\mathbb{E}}\left[C^{\text{SMD}}_{n}(k;\alpha_{0},m)|\mathscr{D}_{n}\right]=\operatorname{O}\left(n^{-1/2}\right)$ , $\operatorname*{\mathbb{E}}\left[C^{\text{RC}}_{n}(k;\alpha_{0},m)|\mathscr{D}_{n}\right]=\operatorname{O}\left(n^{-1/2}\right)$ , and $\operatorname*{\mathbb{E}}\left[C_{n}^{\text{BC}}(k;\alpha_{0},m)|\mathscr{D}_{n}\right]=\operatorname{O}\left(n^{-1/2}\right)$ with probability one. Hence, a simple modification of the proof of Corollary 2 yields the results.

∎

Lemma 1 below is a slightly revised version of Problem 12.1 of Devroye et al. (1996).

Lemma 1.

If a random variable $Z$ satisfies

[TABLE]

for all $\epsilon>0$ and some positive numbers $c_{1}$ and $c_{2}$ , then

[TABLE]

Proof.

For any $v>0$ , we have

[TABLE]

Taking $v=\frac{\log{\{c_{1}\}}}{c_{2}}$ yields

[TABLE]

∎

Lemma 2.

Suppose $(W_{n,1},W_{n,2},\ldots,W_{n,n})$ is a multinomial vector with parameters $n$ and $(1/n,1/n,\ldots,1/n)$ . Then for each $i\in\{1,2,\ldots,n\}$ ,

[TABLE]

Proof.

For each $i$ , $\operatorname*{\mathbb{E}}\left[(W_{n,i}-1)_{+}\right]=\operatorname*{\mathbb{E}}\left[W_{n,i}-1\right]+\operatorname*{\mathbb{E}}\left[(W_{n,i}-1)_{-}\right]=\operatorname*{\mathbb{E}}\left[(W_{n,i}-1)_{-}\right]$ because $\operatorname*{\mathbb{E}}\left[W_{n,i}\right]=1$ . Hence, we have

[TABLE]

∎

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Akaike (1973) Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle, in: Petrov, B.N., Csaki, F. (Eds.), Second International Symposium on Information Theory, pp. 267–281.
2Anthony and Bartlett (1999) Anthony, M., Bartlett, P.L., 1999. Neural Network Learning: Theoretical Foundations. Cambridge University Press. doi: 10.1017/CBO 9780511624216 . · doi ↗
3Arlot and Celisse (2010) Arlot, S., Celisse, A., 2010. A survey of cross-validation procedures for model selection. Statistics Surveys 4, 40–79. doi: 10.1214/09-SS 054 . · doi ↗
4Athey and Imbens (2019) Athey, S., Imbens, G.W., 2019. Machine learning methods that economists should know about. Annual Review of Economics 11, 685–725. doi: 10.1146/annurev-economics-080217-053433 . · doi ↗
5Bagby et al. (2002) Bagby, T., Bos, L., Levenberg, N., 2002. Multivariate simultaneous approximation. Constructive Approximation 18, 569–577. doi: 10.1007/s 00365-001-0024-6 . · doi ↗
6Barberis and Xiong (2012) Barberis, N., Xiong, W., 2012. Realization utility. Journal of Financial Economics 104, 251–271. doi: http://dx.doi.org/10.1016/j.jfineco.2011.10.005 . · doi ↗
7Bartlett et al. (2002) Bartlett, P.L., Boucheron, S., Lugosi, G., 2002. Model selection and error estimation. Machine Learning 48, 85–113. doi: 10.1023/A:1013999503812 . · doi ↗
8Boucheron et al. (2013) Boucheron, S., Lugosi, G., Massart, P., 2013. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press. doi: 10.1093/acprof:oso/9780199535255.001.0001 . · doi ↗

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Model Selection in Utility-Maximizing Binary Prediction

Abstract

1 Introduction

2 Maximum Utility Estimation

2.1 Model

2.2 Nature of the Overfitting in MU Estimation

3 Model Selection

3.1 UMPR with a Distribution-Free Penalty

Proposition 1**.**

Corollary 1**.**

Theorem 1**.**

Corollary 2**.**

Proposition 2**.**

Remark 1**.**

Remark 2**.**

3.2 UMPR with a Data-Dependent Penalty

Theorem 2**.**

Corollary 3**.**

4 Simulation

5 Conclusion

Appendix A Selection of α0\alpha_{0}α0​

Appendix B Pretesting and Cross-Validation

B.1 Pretesting

B.2 Cross-Validation

Appendix C Technical Proofs

C.1 Proof of Proposition 1

Proof.

C.2 Proof of Corollary 1

Proof.

C.3 Proof of Theorem 1

Proof.

C.4 Proof of Corollary 2

Proof.

C.5 Proof of Proposition 2

Proof.

C.6 Proof of Theorem 2

Proof.

C.7 Proof of Corollary 3

Proof.

Lemma 1**.**

Proof.

Lemma 2**.**

Proof.

Proposition 1.

Corollary 1.

Theorem 1.

Corollary 2.

Proposition 2.

Remark 1.

Remark 2.

Theorem 2.

Corollary 3.

Appendix A Selection of $\alpha_{0}$

Lemma 1.

Lemma 2.