Semi-Supervised Ordinal Regression Based on Empirical Risk Minimization

Taira Tsuchiya; Nontawat Charoenphakdee; Issei Sato; Masashi Sugiyama

arXiv:1901.11351·cs.LG·June 11, 2021

Semi-Supervised Ordinal Regression Based on Empirical Risk Minimization

Taira Tsuchiya, Nontawat Charoenphakdee, Issei Sato, Masashi Sugiyama

PDF

TL;DR

This paper introduces a flexible semi-supervised ordinal regression framework based on empirical risk minimization, capable of optimizing various metrics with theoretical guarantees and no restrictive assumptions on unlabeled data.

Contribution

It proposes a novel, general framework for semi-supervised ordinal regression that supports multiple metrics, model choices, and provides theoretical consistency guarantees.

Findings

01

Framework effectively optimizes multiple evaluation metrics.

02

Theoretical analysis confirms estimator consistency.

03

Experimental results demonstrate practical usefulness.

Abstract

Ordinal regression is aimed at predicting an ordinal class label. In this paper, we consider its semi-supervised formulation, in which we have unlabeled data along with ordinal-labeled data to train an ordinal regressor. There are several metrics to evaluate the performance of ordinal regression, such as the mean absolute error, mean zero-one error, and mean squared error. However, the existing studies do not take the evaluation metric into account, have a restriction on the model choice, and have no theoretical guarantee. To overcome these problems, we propose a novel generic framework for semi-supervised ordinal regression based on the empirical risk minimization principle that is applicable to optimizing all of the metrics mentioned above. Besides, our framework has flexible choices of models, surrogate losses, and optimization algorithms without the common geometric assumption on…

Tables7

Table 1. Table 1: Notations of losses that are used in this paper.

Task loss	$ℒ (g (𝒙), y)$	Task surrogate loss	$ψ (𝜶 (𝒙), y)$
Task risk	$ℛ (g)$	Task surrogate risk	$𝒮 (g)$
$0 - 1$ loss	$𝟙 [m < 0]$	Binary surrogate loss	$ℓ (z)$

Table 2. Table 3: The mean and standard error of the mean absolute error, mean zero-one error, and mean squared error on small benchmark datasets. The experiments were conducted 20 times. Outperforming methods in each evaluation metric an dataset are highlighted in boldface using the Wilcoxon signed-rank test with a significance level of 5%.

	Mean Absolute Error			Mean Zero-one Error			Mean Squared Error
	SV	SEMI1	SEMI2	SV	SEMI1	SEMI2	SV	SEMI1	SEMI2
abalone	0.381 (0.02)	0.370 (0.04)	0.356 (0.03)	0.372 (0.02)	0.368 (0.03)	0.353 (0.03)	0.447 (0.06)	0.392 (0.05)	0.365 (0.03)
bank1-5	0.312 (0.06)	0.280 (0.06)	0.285 (0.05)	0.309 (0.06)	0.277 (0.06)	0.283 (0.05)	0.353 (0.09)	0.281 (0.05)	0.290 (0.06)
bank2-5	0.518 (0.04)	0.518 (0.04)	0.616 (0.04)	0.484 (0.03)	0.475 (0.03)	0.552 (0.03)	1.110 (0.08)	0.701 (0.07)	0.710 (0.09)
census1-5	0.453 (0.07)	0.414 (0.04)	0.416 (0.04)	0.425 (0.06)	0.400 (0.03)	0.397 (0.03)	0.544 (0.07)	0.468 (0.04)	0.431 (0.03)
census2-5	0.534 (0.07)	0.496 (0.06)	0.500 (0.05)	0.488 (0.06)	0.453 (0.05)	0.457 (0.04)	0.755 (0.11)	0.627 (0.10)	0.530 (0.06)
computer1-5	0.308 (0.05)	0.295 (0.05)	0.302 (0.04)	0.299 (0.05)	0.291 (0.04)	0.300 (0.04)	0.461 (0.10)	0.403 (0.06)	0.329 (0.06)
computer2-5	0.312 (0.04)	0.286 (0.03)	0.317 (0.04)	0.306 (0.04)	0.283 (0.03)	0.310 (0.04)	0.511 (0.08)	0.432 (0.06)	0.356 (0.05)
fireman	0.556 (0.08)	0.556 (0.08)	0.554 (0.07)	0.471 (0.06)	0.464 (0.06)	0.461 (0.04)	0.637 (0.12)	0.626 (0.11)	0.591 (0.08)
kinematics	0.635 (0.06)	0.606 (0.05)	0.608 (0.06)	0.536 (0.03)	0.526 (0.04)	0.528 (0.03)	0.779 (0.10)	0.702 (0.07)	0.743 (0.12)
lev	0.339 (0.07)	0.308 (0.04)	0.301 (0.04)	0.335 (0.06)	0.308 (0.03)	0.299 (0.04)	0.354 (0.07)	0.345 (0.13)	0.304 (0.05)
swd	0.701 (0.05)	0.700 (0.12)	0.620 (0.05)	0.579 (0.03)	0.566 (0.03)	0.537 (0.04)	0.873 (0.12)	0.765 (0.10)	0.683 (0.08)
toy	0.266 (0.07)	0.215 (0.05)	0.226 (0.05)	0.235 (0.05)	0.212 (0.04)	0.215 (0.04)	0.361 (0.10)	0.250 (0.08)	0.293 (0.08)

Table 3. Table 4: The mean and standard error of the mean absolute error on benchmark datasets. Outperforming methods in each dataset are highlighted in boldface using the Wilcoxon signed-rank test with a significance level of 5%. The experiments were conducted 20 times.

	SSMOR	TOR	SV-Ker	SEMI1-Ker	SEMI2-Ker	SV-NN	SEMI1-NN	SEMI2-NN
bank1-5	0.935 (0.16)	0.250 (0.01)	0.249 (0.01)	0.294 (0.05)	0.339 (0.06)	0.267 (0.01)	0.251 (0.01)	0.248 (0.01)
bank2-5	1.184 (0.07)	0.907 (0.03)	0.883 (0.02)	0.887 (0.03)	0.915 (0.03)	0.834 (0.02)	0.839 (0.02)	0.828 (0.02)
census1-5	1.190 (0.11)	0.803 (0.03)	0.785 (0.03)	0.792 (0.02)	0.803 (0.02)	0.785 (0.03)	0.686 (0.02)	0.697 (0.02)
census2-5	1.206 (0.14)	0.784 (0.02)	0.768 (0.02)	0.779 (0.02)	0.798 (0.03)	0.745 (0.02)	0.693 (0.02)	0.689 (0.02)
computer1-5	0.671 (0.05)	0.444 (0.02)	0.441 (0.02)	0.445 (0.02)	0.459 (0.02)	0.474 (0.03)	0.413 (0.02)	0.409 (0.02)
computer2-5	0.668 (0.07)	0.379 (0.01)	0.376 (0.01)	0.387 (0.01)	0.397 (0.02)	0.457 (0.01)	0.367 (0.01)	0.370 (0.02)
fireman	2.999 (0.44)	3.690 (0.07)	3.649 (0.06)	3.732 (0.13)	3.798 (0.18)	1.298 (0.05)	1.602 (0.08)	1.593 (0.09)
kinematics	1.797 (0.13)	1.442 (0.02)	1.454 (0.02)	1.474 (0.02)	1.474 (0.02)	1.155 (0.03)	1.175 (0.04)	1.175 (0.04)

Table 4. Table 5: The mean and standard error of the zero-one error on benchmark datasets. Outperforming methods in each dataset are highlighted in boldface using the Wilcoxon signed-rank test with a significance level of 5%. The experiments were conducted 20 times.

	SSMOR	TOR	SV-Ker	SEMI1-Ker	SEMI2-Ker	SV-NN	SEMI1-NN	SEMI2-NN
bank1-5	0.608 (0.04)	0.238 (0.01)	0.238 (0.01)	0.244 (0.01)	0.246 (0.01)	0.273 (0.01)	0.255 (0.01)	0.259 (0.01)
bank2-5	0.706 (0.02)	0.646 (0.01)	0.651 (0.02)	0.641 (0.01)	0.647 (0.01)	0.602 (0.01)	0.601 (0.01)	0.605 (0.01)
census1-5	0.697 (0.02)	0.569 (0.01)	0.579 (0.01)	0.584 (0.02)	0.585 (0.01)	0.574 (0.01)	0.556 (0.01)	0.560 (0.01)
census2-5	0.689 (0.03)	0.568 (0.01)	0.572 (0.01)	0.579 (0.01)	0.576 (0.01)	0.549 (0.01)	0.541 (0.01)	0.538 (0.01)
computer1-5	0.544 (0.04)	0.386 (0.01)	0.386 (0.01)	0.386 (0.01)	0.386 (0.01)	0.416 (0.02)	0.376 (0.02)	0.383 (0.01)
computer2-5	0.537 (0.04)	0.339 (0.01)	0.339 (0.01)	0.341 (0.01)	0.344 (0.02)	0.398 (0.01)	0.343 (0.01)	0.357 (0.02)
fireman	0.863 (0.02)	0.927 (0.01)	0.926 (0.00)	0.927 (0.01)	0.926 (0.01)	0.682 (0.01)	0.718 (0.01)	0.718 (0.02)
kinematics	0.785 (0.01)	0.768 (0.01)	0.779 (0.01)	0.781 (0.01)	0.781 (0.01)	0.705 (0.01)	0.711 (0.01)	0.711 (0.01)

Table 5. Table 6: The mean and the standard error of the mean squared error on benchmark datasets. Outperforming methods in each dataset are highlighted in boldface using the Wilcoxon signed-rank test with a significance level of 5%. The experiments were conducted 20 times.

	SSMOR	TOR	SV-Ker	SEMI1-Ker	SEMI2-Ker	SV-NN	SEMI1-NN	SEMI2-NN
bank1-5	1.719 (0.47)	0.364 (0.02)	0.363 (0.02)	0.415 (0.08)	0.432 (0.06)	0.317 (0.02)	0.313 (0.02)	0.306 (0.02)
bank2-5	2.448 (0.23)	1.282 (0.05)	1.256 (0.04)	1.595 (0.07)	1.380 (0.08)	1.464 (0.07)	1.481 (0.07)	1.476 (0.07)
census1-5	2.524 (0.44)	1.233 (0.08)	1.183 (0.06)	1.188 (0.07)	1.222 (0.06)	1.264 (0.08)	1.053 (0.07)	1.068 (0.06)
census2-5	2.493 (0.51)	1.205 (0.07)	1.172 (0.06)	1.140 (0.04)	1.198 (0.06)	1.189 (0.06)	1.100 (0.05)	1.079 (0.04)
computer1-5	0.958 (0.09)	0.550 (0.03)	0.538 (0.03)	0.542 (0.03)	0.569 (0.05)	0.618 (0.05)	0.536 (0.03)	0.515 (0.03)
computer2-5	0.944 (0.16)	0.457 (0.02)	0.453 (0.02)	0.456 (0.02)	0.484 (0.02)	0.604 (0.05)	0.502 (0.05)	0.480 (0.03)
fireman	15.753 (4.34)	6.472 (0.24)	6.519 (0.24)	7.553 (1.32)	7.226 (1.01)	2.766 (0.19)	3.319 (0.33)	3.168 (0.28)
kinematics	5.633 (0.74)	2.878 (0.14)	2.751 (0.08)	2.994 (0.14)	2.994 (0.14)	2.597 (0.11)	2.362 (0.13)	2.362 (0.13)

Table 6. Table 7: Dataset statistics of small scale experiments.

	dim.	# of unlabeled data	# of test data	class prior
abalone	10	871	1253	[0.2, 0.6, 0.2]
bank1-5	8	1851	2000	[0.2, 0.6, 0.2]
bank2-5	32	1851	2000	[0.2, 0.6, 0.2]
census1-5	8	2000	2000	[0.2, 0.6, 0.2]
census2-5	16	2000	2000	[0.2, 0.6, 0.2]
computer1-5	12	1851	2000	[0.2, 0.6, 0.2]
computer2-5	21	1851	2000	[0.2, 0.6, 0.2]
fireman-example	9	2000	2000	[0.5, 0.25, 0.25]
kinematics	7	1851	2000	[0.38, 0.38, 0.25]
lev	4	204	300	[0.09, 0.68, 0.22]
swd	10	204	300	[0.38, 0.4, 0.22]
toy	2	57	90	[0.12, 0.78, 0.1]

Table 7. Table 8: Dataset statistics of large scale experiments.

	# of class	class prior
bank1-5	5	[0.2, 0.2, 0.2, 0.2, 0.2]
bank2-5	5	[0.2, 0.2, 0.2, 0.2, 0.2]
census1-5	5	[0.2, 0.2, 0.2, 0.2, 0.2]
census2-5	5	[0.2, 0.2, 0.2, 0.2, 0.2]
computer1-5	5	[0.2, 0.2, 0.2, 0.2, 0.2]
computer2-5	5	[0.2, 0.2, 0.2, 0.2, 0.2]
fireman-example	16	around 0.0625 for all classes
kinematics	8	around 0.125 for all classes

Equations260

g (x; f, θ)

g (x; f, θ)

R (g) : = E_{X, Y} [L (g (X), Y)],

R (g) : = E_{X, Y} [L (g (X), Y)],

L (g (x), y) = ∣ y - g (x) ∣.

L (g (x), y) = ∣ y - g (x) ∣.

R (g) = E_{X, Y} [∣ Y - g (X) ∣] .

R (g) = E_{X, Y} [∣ Y - g (X) ∣] .

∣ y - g (x; f, θ) ∣ = i = 1 \sum y - 1 \mathbbm 1 [α_{i} (x) \geq 0] + i = y \sum K - 1 \mathbbm 1 [α_{i} (x) < 0],

∣ y - g (x; f, θ) ∣ = i = 1 \sum y - 1 \mathbbm 1 [α_{i} (x) \geq 0] + i = y \sum K - 1 \mathbbm 1 [α_{i} (x) < 0],

α : X \to R^{K - 1}, α_{i} (x) : = θ_{i} - f (x) (i = 1, \dots, K - 1),

α : X \to R^{K - 1}, α_{i} (x) : = θ_{i} - f (x) (i = 1, \dots, K - 1),

L (g (x), y) = \mathbbm 1 [y \neq = g (x)] .

L (g (x), y) = \mathbbm 1 [y \neq = g (x)] .

\mathbbm 1 [y \neq = g (x)] = ⎩ ⎨ ⎧ \mathbbm 1 [α_{1} (x) < 0] \mathbbm 1 [α_{y - 1} (x) \geq 0] + \mathbbm 1 [α_{y} (x) < 0] \mathbbm 1 [α_{K - 1} (x) \geq 0] (y = 1), (y \in {2, \dots, K - 1}), (y = K) .

\mathbbm 1 [y \neq = g (x)] = ⎩ ⎨ ⎧ \mathbbm 1 [α_{1} (x) < 0] \mathbbm 1 [α_{y - 1} (x) \geq 0] + \mathbbm 1 [α_{y} (x) < 0] \mathbbm 1 [α_{K - 1} (x) \geq 0] (y = 1), (y \in {2, \dots, K - 1}), (y = K) .

L (g (x), y) = (y - g (x))^{2} .

L (g (x), y) = (y - g (x))^{2} .

ψ_{AT} (α (x), y) : = i = 1 \sum y - 1 ℓ (- α_{i} (x)) + i = y \sum K - 1 ℓ (α_{i} (x)) .

ψ_{AT} (α (x), y) : = i = 1 \sum y - 1 ℓ (- α_{i} (x)) + i = y \sum K - 1 ℓ (α_{i} (x)) .

ψ_{IT} (α (x), y) : = ⎩ ⎨ ⎧ ℓ (α_{1} (x)) ℓ (- α_{y - 1} (x)) + ℓ (α_{y} (x)) ℓ (- α_{K - 1} (x)) (y = 1), (y \in {2, \dots, K - 1}), (y = K) .

ψ_{IT} (α (x), y) : = ⎩ ⎨ ⎧ ℓ (α_{1} (x)) ℓ (- α_{y - 1} (x)) + ℓ (α_{y} (x)) ℓ (- α_{K - 1} (x)) (y = 1), (y \in {2, \dots, K - 1}), (y = K) .

ψ_{LS} (α (x), y) : = (y + α_{1} (x) - \frac{3}{2})^{2} .

ψ_{LS} (α (x), y) : = (y + α_{1} (x) - \frac{3}{2})^{2} .

S (g) : = E_{X, Y} [ψ (α (X), Y)] .

S (g) : = E_{X, Y} [ψ (α (X), Y)] .

X_{L} : = {(x_{j}^{L}, y_{j})}_{j = 1}^{n_{L}} \sim i.i.d. p (X, Y), X_{U} : = {x_{j}^{U}}_{j = 1}^{n_{U}} \sim i.i.d. p (X) = y = 1 \sum K π_{y} p (X ∣ Y = y),

X_{L} : = {(x_{j}^{L}, y_{j})}_{j = 1}^{n_{L}} \sim i.i.d. p (X, Y), X_{U} : = {x_{j}^{U}}_{j = 1}^{n_{U}} \sim i.i.d. p (X) = y = 1 \sum K π_{y} p (X ∣ Y = y),

S_{LU}^{\ k} (g) = y \in Y^{\ k} \sum π_{y} E_{X ∣ Y = y} [ψ (α (X), y)] + E_{U} [ψ (α (X), k)] - y \in Y^{\ k} \sum π_{y} E_{X ∣ Y = y} [ψ (α (X), k)],

S_{LU}^{\ k} (g) = y \in Y^{\ k} \sum π_{y} E_{X ∣ Y = y} [ψ (α (X), y)] + E_{U} [ψ (α (X), k)] - y \in Y^{\ k} \sum π_{y} E_{X ∣ Y = y} [ψ (α (X), k)],

S_{LU}^{\ k} (g) : = = : S_{L}^{(1)} (g) y \in Y^{\ k} \sum \frac{π _{y}}{n _{y}} j = 1 \sum n_{y} ψ (α (x_{j}^{y}), y) + = : S_{U} (g) \frac{1}{n _{U}} j = 1 \sum n_{U} ψ (α (x_{j}^{U}), k) - = : S_{L}^{(2)} (g) y \in Y^{\ k} \sum \frac{π _{y}}{n _{y}} j = 1 \sum n_{y} ψ (α (x_{j}^{y}), k) .

S_{LU}^{\ k} (g) : = = : S_{L}^{(1)} (g) y \in Y^{\ k} \sum \frac{π _{y}}{n _{y}} j = 1 \sum n_{y} ψ (α (x_{j}^{y}), y) + = : S_{U} (g) \frac{1}{n _{U}} j = 1 \sum n_{U} ψ (α (x_{j}^{U}), k) - = : S_{L}^{(2)} (g) y \in Y^{\ k} \sum \frac{π _{y}}{n _{y}} j = 1 \sum n_{y} ψ (α (x_{j}^{y}), k) .

S_{SEMI - γ}^{\ k} (g) : = γ S_{LU}^{\ k} (g) + (1 - γ) S (g) .

S_{SEMI - γ}^{\ k} (g) : = γ S_{LU}^{\ k} (g) + (1 - γ) S (g) .

R (F; n, p) : = E_{Z_{1}, \dots, Z_{n}} E_{σ} [f \in F sup \frac{1}{n} i = 1 \sum n σ_{i} f (Z_{i})] .

R (F; n, p) : = E_{Z_{1}, \dots, Z_{n}} E_{σ} [f \in F sup \frac{1}{n} i = 1 \sum n σ_{i} f (Z_{i})] .

Θ

Θ

G

A_{i}

g^{*} : = f \in F, θ \in Θ arg min S_{SEMI - γ}^{\ k} (g) (= f \in F, θ \in Θ arg min S (g))

g^{*} : = f \in F, θ \in Θ arg min S_{SEMI - γ}^{\ k} (g) (= f \in F, θ \in Θ arg min S (g))

g^{\ k} : = f \in F, θ \in Θ arg min S_{SEMI - γ}^{\ k} (\neq \equiv f \in F, θ \in Θ arg min S_{ψ} (g))

g^{\ k} : = f \in F, θ \in Θ arg min S_{SEMI - γ}^{\ k} (\neq \equiv f \in F, θ \in Θ arg min S_{ψ} (g))

H_{AT} H_{IT} H_{LS} : = {(x, y) \mapsto ψ_{AT} (α (x), y) : f \in F, θ \in Θ}, : = {(x, y) \mapsto ψ_{IT} (α (x), y) : f \in F, θ \in Θ}, : = {(x, y) \mapsto ψ_{LS} (α (x), y) : f \in F} .

H_{AT} H_{IT} H_{LS} : = {(x, y) \mapsto ψ_{AT} (α (x), y) : f \in F, θ \in Θ}, : = {(x, y) \mapsto ψ_{IT} (α (x), y) : f \in F, θ \in Θ}, : = {(x, y) \mapsto ψ_{LS} (α (x), y) : f \in F} .

H_{AT}^{(y)} H_{IT}^{(y)} H_{LS}^{(y)} : = {x \mapsto ψ_{AT} (α (x), y) : f \in F, θ \in Θ}, : = {x \mapsto ψ_{IT} (α (x), y) : f \in F, θ \in Θ}, : = {x \mapsto ψ_{LS} (α (x), y) : f \in F} .

H_{AT}^{(y)} H_{IT}^{(y)} H_{LS}^{(y)} : = {x \mapsto ψ_{AT} (α (x), y) : f \in F, θ \in Θ}, : = {x \mapsto ψ_{IT} (α (x), y) : f \in F, θ \in Θ}, : = {x \mapsto ψ_{LS} (α (x), y) : f \in F} .

R (H_{AT}^{(y)}; n) R (H_{IT}^{(y)}; n) R (H_{LS}^{(y)}; n) \leq ρ j = 1 \sum K - 1 R (A_{j}; n), \leq ρ (R (A_{y - 1}; n) + R (A_{y}; n)), \leq 2∣ y + θ_{1} - 3/2∣ R (F; n) + R (sq \circ F; n),

R (H_{AT}^{(y)}; n) R (H_{IT}^{(y)}; n) R (H_{LS}^{(y)}; n) \leq ρ j = 1 \sum K - 1 R (A_{j}; n), \leq ρ (R (A_{y - 1}; n) + R (A_{y}; n)), \leq 2∣ y + θ_{1} - 3/2∣ R (F; n) + R (sq \circ F; n),

R (H_{AT}; n) R (H_{IT}; n) R (H_{LS}; n) \leq ρ K j = 1 \sum K - 1 R (A_{j}; n), \leq ρ y \in Y \sum (R (A_{y - 1}; n) + R (A_{y}; n)), \leq 2 (K + ∣ θ_{1} - 3/2∣) R (F; n) + R (sq \circ F; n),

R (H_{AT}; n) R (H_{IT}; n) R (H_{LS}; n) \leq ρ K j = 1 \sum K - 1 R (A_{j}; n), \leq ρ y \in Y \sum (R (A_{y - 1}; n) + R (A_{y}; n)), \leq 2 (K + ∣ θ_{1} - 3/2∣) R (F; n) + R (sq \circ F; n),

S (g^{\ k}) - S (g^{*})

S (g^{\ k}) - S (g^{*})

\leq 4 γ y \in Y^{\ k} \sum π_{y} (R_{UB}^{(y)} (n_{y}) + R_{UB}^{(k)} (n_{y})) + R_{UB}^{(k)} (n_{U}) + (1 - γ) R_{UB} (n_{L})

+ 22 C_{ψ} γ 2 y \in Y^{\ k} \sum \frac{π _{y}}{n _{y}} + \frac{1}{n _{U}} + (1 - γ) \frac{1}{n _{L}} ln \frac{2 ( K + 1 )}{δ},

R_{UB}^{(y)} (n)

R_{UB}^{(y)} (n)

R_{UB} (n)

F = {f (x) = w^{⊤} ϕ (x) : ∥ w ∥ \leq C_{w}, ∥ ϕ (x) ∥_{2} \leq C_{ϕ}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Semi-Supervised Ordinal Regression

Based on Empirical Risk Minimization

Taira [email protected] at Kyoto University

The University of Tokyo

RIKEN AIP

Nontawat Charoenphakdee

The University of Tokyo

RIKEN AIP

Issei Sato

The University of Tokyo

Masashi Sugiyama

RIKEN AIP

The University of Tokyo

Abstract

Ordinal regression is aimed at predicting an ordinal class label. In this paper, we consider its semi-supervised formulation, in which we have unlabeled data along with ordinal-labeled data to train an ordinal regressor. There are several metrics to evaluate the performance of ordinal regression, such as the mean absolute error, mean zero-one error, and mean squared error. However, the existing studies do not take the evaluation metric into account, have a restriction on the model choice, and have no theoretical guarantee. To overcome these problems, we propose a novel generic framework for semi-supervised ordinal regression based on the empirical risk minimization principle that is applicable to optimizing all of the metrics mentioned above. Besides, our framework has flexible choices of models, surrogate losses, and optimization algorithms without the common geometric assumption on unlabeled data such as the cluster assumption or manifold assumption. We further provide an estimation error bound to show that our risk estimator is consistent. Finally, we conduct experiments to show the usefulness of our framework.

1 Introduction

The goal of ordinal regression is to learn an ordinal regressor to predict a label from a discrete and ordered label set (Chu and Keerthi, 2005; Chu and Ghahramani, 2005; Gutierrez et al., 2016; Pedregosa et al., 2017). For example, consider the problem of predicting the diabetes stage of a patient. The progress of diabetes consists of five stages ranging from mild to severe conditions (Weir and Bonner-Weir, 2004). The stage number is discrete, and the possible stages are total-ordered. Ordinal regression has been employed in a variety of fields such as medical research (Bender and Grouven, 1997, 1998), credit rating (Kim and Ahn, 2012; Dikkers and Rothkrantz, 2005), and social sciences (Fullerton and Xu, 2012). In the real-world, the labeling process can be costly and time-consuming. Hence, it is desirable to make use of unlabeled data to improve the prediction accuracy. Although semi-supervised ordinal regression has great benefits to practical applications, it has not been extensively explored yet (Hua-fu, 2010; Liu et al., 2011; Seah et al., 2012; Srijith et al., 2013; Pérez-Ortiz et al., 2016).

The main challenge of semi-supervised learning is how to incorporate unlabeled data to improve the prediction performance (Chapelle et al., 2006). To make use of unlabeled data, many assumptions on unlabeled data have been proposed, which often be violated in real-world problems. The representative one is the manifold assumption (Belkin et al., 2006), in which input data is assumed to be distributed in a lower-dimensional manifold. The manifold assumption has also been utilized in the context of ordinal regression (Liu et al., 2011; Pérez-Ortiz et al., 2016). Another popular assumption is the cluster assumption (Seeger, 2000), in which examples belonging to the same cluster in the input space should have the same label. Seah et al. (2012) proposed a transductive ordinal regression method based on the cluster assumption. However, such a cluster assumption may no be satisfied in practice. From a different perspective, a semi-supervised ordinal regression method based on Gaussian processes has been proposed (Srijith et al., 2013), which relies on the low-density separation assumption (Chapelle et al., 2006). However, the low-density separation assumption may rarely be satisfied in reality, and Gaussian processes have high computation costs and also restrictions on the choice of models. It is known that the performance of semi-supervised learning methods based on such a geometric assumption is significantly degraded if the assumption on unlabeled data is violated (Li and Zhou, 2015; Sakai et al., 2017). Moreover, all the existing semi-supervised ordinal regression methods mentioned above do not have a theoretical guarantee, do not take the target evaluation metric into account, and have limitations in the choice of models.

Our goal is to establish a novel generic framework that allows flexible choices of models, evaluation metrics, and optimization algorithms and does not require any geometric assumption on unlabeled data. From the recent theoretical advances of ordinal regression in Pedregosa et al. (2017) and semi-supervised binary classification based on positive-unlabeled classification in Sakai et al. (2017), we propose an empirical risk minimization (ERM) based framework that achieves this ambitious goal. To theoretically justify our framework, we show that our proposed unbiased risk estimator is consistent by establishing an estimation error bound. In addition, we conduct the analysis of variance reduction to illustrate that unlabeled data can help improve the performance. Finally, we demonstrate the usefulness of our framework by conducting experiments for the three evaluation metrics using benchmark datasets.

2 Preliminaries

In this section, we formulate the standard supervised ordinal regression problem. Besides, we discuss its loss function, which is specific to the ordinal regression problem and largely different from the standard loss functions for regression and classification.

2.1 Supervised Ordinal Regression

We formulate the risk used in supervised ordinal regression. Let $\mathcal{X}\subset\mathbb{R}^{d}$ be a $d$ -dimensional input space and $\mathcal{Y}=\{1,\ldots,K\}$ be an ordered label space, where $K$ is the number of classes. We assume that labeled data $(\bm{x},y)\in\mathcal{X}\times\mathcal{Y}$ is drawn from the joint probability distribution with density $p$ . In ordinal regression, a function $g:\mathcal{X}\rightarrow\mathcal{Y}$ , which is called a prediction function, is used to predict an ordinal label $y$ from input $\bm{x}$ as

[TABLE]

where $\bm{\theta}=\left[\theta_{1},\ldots,\theta_{K-1}\right]^{\top}\in\mathbb{R}^{K-1}$ is the threshold parameters that should be ordered $\theta_{1}\leq\theta_{2}\leq\cdots\leq\theta_{K-1}$ , $f:\mathcal{X}\to\mathbb{R}$ is the decision function, and $\mathbbm{1}{\left[{\cdot}\right]}$ is the indicator function which is $1$ if the inner condition holds and otherwise is [math] (Pedregosa et al., 2017). Note that $f$ is a function parameterized by some parameters that is unrelated to $\bm{\theta}$ , and that $f$ and $\bm{\theta}$ are together the parameters of $g$ . The goal of ordinal regression is to obtain a prediction function that minimizes the task risk defined as

[TABLE]

where $\mathbb{E}_{X,Y}[\cdot]$ denotes the expectation over the joint distribution over $\mathcal{X}\times\mathcal{Y}$ , and $\mathcal{L}:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}_{\geq 0}$ is called a task loss function. The task loss function $\mathcal{L}(g(\bm{x}),y)$ measures the performance of a prediction function based on a difference between the output of the prediction function $g(\bm{x})\in\mathcal{Y}$ and the true class label $y\in\mathcal{Y}$ .

There are a variety of task losses used in the ordinal regression problem (Pedregosa et al., 2017). One of the most often used task losses is the absolute loss, which is

[TABLE]

The task risk (2) with the absolute loss is expressed as

[TABLE]

With the property of the prediction function, the following proposition holds for the absolute loss.

Proposition 1 (Pedregosa et al. (2017)).

The absolute loss is equivalently expressed as

[TABLE]

where

[TABLE]

and $\alpha_{i}(\bm{x})$ denotes the $i$ -th element of $\bm{\alpha}(\bm{x})$ .

Another task loss that has been popularly used is the zero-one loss (Chu and Keerthi, 2005), which is

[TABLE]

Using the definition of the prediction function, the following proposition holds for the zero-one loss.

Proposition 2 (Pedregosa et al. (2017)).

The zero-one loss is equivalently expressed as

[TABLE]

The other task loss function used that is frequently used is the squared loss (Pedregosa et al., 2017), which is defined as

[TABLE]

The above definitions of each task loss function imply that the task losses in ordinal regression are non-continuous functions, due to the discrete nature of the prediction function. It is known that the optimization over such a non-continuous function are computationally infeasible (Feldman et al., 2012; Ben-David et al., 2003). In practice, we do not directly minimize the task risk but relax the task risk to a surrogate risk, which will be explained in Section 2.2.

2.2 Surrogate Losses for Ordinal Regression

We first discuss one of the task surrogate losses to the absolute loss called the all threshold (AT) loss (Pedregosa et al., 2017). Table 1 shows the notations used throughout this paper. The AT loss can be obtained by replacing the $0\text{-}1$ loss $\mathbbm{1}{\left[{m<0}\right]}$ in (5) with a binary surrogate loss $\ell:\mathbb{R}\to\mathbb{R}_{\geq 0}$ . The main motivation to use a surrogate loss instead of the $0\text{-}1$ loss is to make optimization easier. The binary surrogate loss should be an optimization-friendly function and also satisfy the minimum requirement to be Fisher-consistent, as described in Pedregosa et al. (2017). For example, the well-known squared loss and logistic loss are valid to be employed in the AT losses in ordinal regression. Specifically, the task surrogate loss $\psi_{\mathrm{AT}}:\mathbb{R}^{K-1}\times\mathcal{Y}\rightarrow\mathbb{R}_{\geq 0}$ based on (5) can be expressed as

[TABLE]

It is known that the AT loss has good empirical performance compared with other surrogate losses in most cases (Rennie, 2005). In a similar manner, we can define the task surrogate loss to the zero-one loss called the immediate threshold (IT) loss as

[TABLE]

Defining $\alpha_{0}(\bm{x})=\alpha_{K}(\bm{x})=0$ for all $\bm{x}\in\mathcal{X}$ , we may write $\psi_{\mathrm{IT}}(\bm{\alpha}(\bm{x}),y)=\ell(-\alpha_{y-1}(\bm{x}))+\ell(\alpha_{y}(\bm{x}))$ for the notational simplicity. The task surrogate loss to the squared loss called the least-squares (LS) loss does not rely on the binary surrogate loss, and is defined as

[TABLE]

In general, the task surrogate risk guided by a task surrogate loss $\psi:\mathbb{R}^{K-1}\times\mathcal{Y}\rightarrow\mathbb{R}_{\geq 0}$ can be written as

[TABLE]

At glance, the notation of (13) can be confusing since $g$ does not appear in the right-hand side. However, $g$ contains $f$ and $\bm{\theta}$ as parameters, and $\bm{\alpha}$ is defined by $f$ and $\bm{\theta}$ . Thus, we adopt this notation in this paper. In supervised ordinal regression, we are given labeled data drawn independently from the joint density $p$ . Then, the risk (4) can be naively minimized by the ERM framework. Note that the minimization is run over the decision function $f$ and the thresholds $\bm{\theta}$ for the AT and IT losses, while it is run only over $f$ for the LS loss. Note also that regularization schemes can be applied if necessary.

In the real world, collecting labeled training data can be costly. Thus, it is preferable to incorporate unlabeled data to train an ordinal regressor. We can see that direct empirical estimation of the risk term in (4) cannot utilize unlabeled data. In this paper, we mitigate this problem by extending the ERM framework to semi-supervised ordinal regression.

3 Proposed Framework

In this section, we introduce our new formulation for semi-supervised ordinal regression.

Unbiased Risk Estimator. We first derive a risk estimator within the proposed ERM framework. Then, we theoretically investigate the behavior of our risk estimator. Here, we assume that we are given the following data:

[TABLE]

where $n_{\mathrm{L}}$ and $n_{\mathrm{U}}$ denote the number of labeled data and unlabeled data, respectively. Let $n_{y}$ denote the number of labeled data in class $y$ , i.e., $n_{\mathrm{L}}=\sum_{y=1}^{K}n_{y}$ , and $\pi_{y}\coloneqq p(Y=y)$ denote the class-prior probability of the class $y$ such that $\sum_{y=1}^{K}\pi_{y}=1$ .

As discussed in Section 2.2, it is not straightforward to incorporate unlabeled data into the task surrogate risk (13). To handle this problem, we propose to find an equivalent expression of the task surrogate risk (13) so that we can obtain an unbiased risk estimator that uses both unlabeled and labeled data. Our following lemma states that we can rewrite the task surrogate risk to contain the expectations over $K-1$ classes of labeled data and the expectation over the marginal density $p(X)$ .

Lemma 3.

For any $k\in\{1,\ldots,K\}$ , the task surrogate risk (13) is equivalently expressed as

[TABLE]

where $\mathbb{E}_{\mathrm{U}}[\cdot]$ denotes the expectation over unlabeled data, $\mathcal{Y}^{\backslash k}\coloneqq\mathcal{Y}\backslash\{k\}$ , and LU stands for "Labeled-Unlabeled".

The proof of Lemma 3 is given in Appendix A. With the risk obtained from Lemma 3, we can derive the following unbiased risk estimator for the task surrogate risk (13):

[TABLE]

We can interpret the risk estimator in (16) as follows. The first term on the right-hand side of (16) indicates that labeled data that are not from class $k$ should be predicted correctly. The second term indicates that unlabeled data should be predicted as $k$ . The purpose of the third term is to cancel the bias from the second term by subtracting the risk of all labeled data that are not from class $k$ to be predicted as class $k$ . Lemma 3 is inspired by the technique used in weakly-supervised learning, such as binary classification from positive and unlabeled data and semi-supervised binary classification, which have been shown to be effective (du Plessis et al., 2015; Sakai et al., 2017).

Although we can obtain a risk estimator that can utilize unlabeled data, we still cannot make full use of all given data since our risk estimator in (16) ignores labeled data from class $k$ . To mitigate this problem, inspired by the work on semi-supervised learning for binary classification (Sakai et al., 2017), we propose to combine the risk estimator of supervised ordinal regression with our risk estimator in (16) by the convex combination as follows.

Theorem 4.

For any $k\in\mathcal{Y}$ and $\gamma\in[0,1]$ , the task surrogate risk (13) is equivalently expressed as

[TABLE]

Theorem 4 directly follows from Lemma 3. It is worth noting that our risk (17) is equivalent to the ordinary surrogate risk in (13). Therefore, the theory of the Fisher-consistency of surrogate losses and excess risk bounds in the ordinary ordinal regression (Pedregosa et al., 2017) are directly applicable to our framework, which will be discussed in the next section.

Roles of Unlabeled Data. The proposed risk does not use geometric information of unlabeled data; thus, it is applicable even when a specific geometric assumption does not hold. Then, an important question to be answered is, "how does unlabeled data help us obtain a good ordinal regressor?" In the training phase, by introducing unlabeled data, we can expect that the variance of the empirical risk is decreased, resulting in more accurate risk estimation. Also, in the validation phase, the variance of the validation risk estimator is reduced, helping us to select good hyper-parameters.

The removed class $k$ can be determined arbitrarily as shown in Lemma 3, and the performance can be depending on the choice of $k$ . We will investigate strategies to select the class $k$ in Section 5.1 based on the theory discussed in the next section.

4 Theoretical Analysis

This section establishes the Fisher-consistency and estimation error bounds to elucidate that our risk estimator $\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)$ has theoretical guarantees. Besides, we provide a theoretical analysis of variance reduction, which shows that the variance of the empirical semi-supervised risk can be smaller than that of the empirical supervised risk.

4.1 Fisher-Consistency

To ensure that the minimizer of our task surrogate risk $\mathcal{S}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}$ can give the optimal solution to the task risk (2), we give the following proposition.

Proposition 5 (Fisher-Consistency).

Let the task surrogate loss $\psi$ be the AT loss, IT loss, or LS loss. Assume for the AT and IT losses that the binary surrogate loss $\ell$ is convex, differentiable at 0, and $\ell^{\prime}(0)<0$ . Then, every minimizer $g$ of $\mathcal{S}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}$ reaches Bayes optimal risk $\inf\{\mathcal{R}(g):\text{ all measurable functions }g\}$ .

The proof of Proposition 5 is given in Appendix B. Although we avoid going into the details, Proposition 5 also holds for the cumulative link and least absolute deviation losses, we can prove Fisher-consistency for them in a similar manner (see Pedregosa et al. (2017) for the definitions of theses losses). With Proposition 5, we can clarify that the following estimation error bound is applicable to a wide range of task surrogate losses.

4.2 Estimation Error Bound

To ensure that our empirical risk estimator $\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)$ is consistent to the risk $\mathcal{S}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)$ and the empirical risk minimizer approaches the true risk minimizer, we establish estimation error bounds here. Let $\mathcal{F}\subset\mathbb{R}^{\mathcal{X}}$ be a class of specified decision functions. Then, we will consider a distribution-dependent complexity measure of functions, called the expected Rademacher complexity (Bartlett and Mendelson, 2002).

Definition 6 (Expected Rademacher complexity).

Let $n$ be a positive integer, $Z_{1},\dots,Z_{n}$ be i.i.d. random variables drawn from a probability distribution with density $p$ over a set $\mathcal{Z}$ , $\mathcal{F}\subset\mathbb{R}^{\mathcal{Z}}$ be a family of functions from $\mathcal{Z}$ to $\mathbb{R}$ , and $\bm{\sigma}=(\sigma_{1},\dots,\sigma_{n})$ be random variables, which take $+1$ and $-1$ with equal probabilities. Then the expected Rademacher complexity of $\mathcal{F}$ is defined as

[TABLE]

Intuitively, the Rademacher complexity quantifies how much our decision function class $\mathcal{F}$ can correlate to the random noise. Thus, a large Rademacher complexity indicates that a decision function is highly flexible to fit the noise. This complexity term is an important tool to derive an estimation error bound (more details about the measures of the complexity of a hypothesis class can be found in Bartlett and Mendelson (2002); Shalev-Shwartz and Ben-David (2014); Mohri et al. (2018)).

To formally define the Rademacher complexity in the context of ordinal regression, we introduce the following definitions.

[TABLE]

Note that $\mathfrak{R}(\mathcal{A}_{0})=\mathfrak{R}(\mathcal{A}_{K})=0$ from the extended definition of $\bm{\alpha}$ . Let

[TABLE]

be the true risk minimizer, and

[TABLE]

be the empirical risk minimizer. Note that $g^{*},\widehat{g}^{\backslash k}\in\mathcal{G}$ are pairs of some $f\in\mathcal{F}$ and $\bm{\theta}\in\Theta$ , respectively. We also define the hypothesis classes guided by each task surrogate loss as

[TABLE]

In a similar manner, we define the the hypothesis classes guided by each task surrogate loss for fixed label $y\in\mathcal{Y}$ as

[TABLE]

We use $\mathcal{H}$ (reps. $\mathcal{H}^{(y)}$ ) instead of $\mathcal{H}_{\mathrm{AT}},\mathcal{H}_{\mathrm{IT}}$ , and $\mathcal{H}_{\mathrm{LS}}$ (reps. $\mathcal{H}_{\mathrm{AT}}^{(y)},\mathcal{H}_{\mathrm{IT}}^{(y)}$ , and $\mathcal{H}_{\mathrm{LS}}^{(y)}$ ), when a statement holds without depending on the used task surrogate loss. At a glance, some definitions may look redundant, but they are needed to investigate properties specific to each of the AT, IT, and LS losses.

To the best of our knowledge, the Rademacher complexity in the context of the ordinal regression problem has never been investigated. The next two theorems establish upper bounds for the guided hypothesis classes.

Theorem 7.

Fix $y\in\mathcal{Y}$ . Let $n\in\mathbb{N}$ and assume for the AT and IT losses that the binary surrogate loss $\ell$ is $\rho$ -Lipschitz. Then, the expected Rademacher complexities of $\mathcal{H}_{\mathrm{AT}}^{(y)},\mathcal{H}_{\mathrm{IT}}^{(y)}$ , and $\mathcal{H}_{\mathrm{LS}}^{(y)}$ are bounded as

[TABLE]

where $\mathrm{sq}:\mathrm{Im}\,f\ni z\mapsto z^{2}\in\mathbb{R}$ .

Theorem 8 (Upper bounds of Rademacher complexities of task surrogate losses).

Let $n\in\mathbb{N}$ and assume for the AT and IT losses that the binary surrogate loss $\ell$ is $\rho$ -Lipschitz. Then, the expected Rademacher complexities of $\mathcal{H}_{\mathrm{AT}},\mathcal{H}_{\mathrm{IT}}$ , and $\mathcal{H}_{\mathrm{LS}}$ are bounded as

[TABLE]

where $\mathrm{sq}:\mathrm{Im}\,f\ni z\mapsto z^{2}\in\mathbb{R}$ .

*Remark**.*

The bound on $\mathfrak{R}(\mathcal{H}_{\mathrm{LS}};n)$ is given using $\mathfrak{R}(\mathcal{F};n)$ instead of $\mathfrak{R}(\mathcal{A}_{j};n)$ . This is because, in the LS loss, the threshold is fixed.

The proofs of Theorems 7 and 8 are given in Appendix E. Both theorems state that the Rademacher complexities with the guided hypothesis classes can be bounded by the Rademacher complexity of the underlying decision functions class $\mathcal{F}$ , and show how $\mathfrak{R}(\mathcal{H};n_{y})$ of each task surrogate loss depends on the problem-dependent parameters. Then, the next theorem establishes estimation error bounds with a class of general decision functions based on Theorems 7 and 8.

Theorem 9 (Estimation error bounds with general decision functions).

Assume for the AT and IT losses that the binary surrogate loss $\ell$ is $\rho$ -Lipschitz, and that there exists a constant $C_{\psi}>0$ such that $\psi(y,\bm{\alpha})\leq C_{\psi}$ for any $y\in\mathcal{Y},\ \bm{\alpha}\in\mathbb{R}^{K-1}$ . Then, for any $k\in\mathcal{Y}$ and $\delta\in(0,1]$ , with probability at least $1-\delta$ ,

[TABLE]

where $\mathfrak{R}_{\mathrm{UB}}^{(y)}(n)$ for $y\in\mathcal{Y}$ and $\mathfrak{R}_{\mathrm{UB}}(n)$ are defined as

[TABLE]

The proof of Theorem 9 is given in Appendix D. Note that the Rademacher complexity with a class of underlying decision functions $\mathcal{F}$ can be bounded in many decision function classes as discussed in Niu et al. (2016); Bao et al. (2018); Mohri et al. (2018), and will be given for a linear-in-parameter models (Lemma 21 in Appendix E).

Next, we restrict the decision function class $\mathcal{F}$ to a class of linear-in-parameter models and see more in detail the behavior of the Rademacher complexities with hypothesis classes guided by each task surrogate loss. The class of linear-in-parameter models is defined as

[TABLE]

for positive constants $C_{\bm{w}}$ and $C_{\bm{\phi}}$ , where $\bm{w}\in\mathbb{R}^{b}$ is a parameter and $\bm{\phi}:\mathbb{R}^{d+1}\rightarrow\mathbb{R}^{b}$ is a basis function. We assume that the bias parameter is included in $\bm{w}$ . We can prove the following lemma using the theorem on the Rademacher complexity bound for ordinal regression (Theorem 8) and a well-known upper bound for the Rademacher complexity with linear-in-parameter models (Lemma 21 in Appendix E).

Lemma 10.

Let $n\in\mathbb{N}$ and assume for the AT and IT losses that the binary surrogate loss $\ell$ is $\rho$ -Lipschitz. Then, the expected Rademacher complexities of $\mathcal{H}_{\mathrm{AT}}^{(y)}$ , $\mathcal{H}_{\mathrm{IT}}^{(y)}$ , $\mathcal{H}_{\mathrm{LS}}^{(y)}$ , $\mathcal{H}_{\mathrm{AT}}$ , $\mathcal{H}_{\mathrm{IT}}$ , and $\mathcal{H}_{\mathrm{LS}}$ with the class of linear-in-parameter models are bounded as

[TABLE]

The proof of Lemma 10 is given in Appendix E. Combining Theorem 9 and Lemma 10 establishes the following estimation error bounds for the linear-in-parameter models.

Corollary 11 (Estimation error bounds with linear-in-parameter models).

Assume for the AT and IT losses that the binary surrogate loss $\ell$ is $\rho$ -Lipschitz and that there exists a constant $C_{\psi}>0$ such that $\psi(y,\bm{\alpha})\leq C_{\psi}$ for any $y\in\mathcal{Y},\ \bm{\alpha}\in\mathbb{R}^{K-1}$ . Then, for any $k\in\mathcal{Y}$ and $\delta\in(0,1]$ , with probability at least $1-\delta$ ,

[TABLE]

where $C^{(y)}$ for $y\in\mathcal{Y}$ and $C$ are constants defined as

[TABLE]

Corollary 11 shows that our proposed risk estimator is consistent, i.e., $\mathcal{S}(\widehat{g}^{\backslash k})\rightarrow\mathcal{S}(g^{*})$ as $n_{y}\rightarrow\infty\;(y=1,\ldots,K)$ and $n_{\mathrm{U}}\rightarrow\infty$ . The convergence rate is

[TABLE]

where $\mathcal{O}_{p}$ denotes the order in probability. This order is the optimal parametric rate for empirical risk minimization without any additional assumption (Mendelson, 2008).

4.3 Variance Reduction

Here, we conduct an analysis of variance reduction to emphasize why incorporating unlabeled data to our risk estimator improves the performance. The following theorem indicates that the variance of the empirical semi-supervised risk can be smaller than that of the empirical supervised risk.

Theorem 12 (Variance reduction).

Fix any $g\in\mathcal{G}$ and assume that $\gamma\in[0,1]$ satisfies

[TABLE]

Then,

[TABLE]

The proof of Theorem 12 is given in Appendix F. This theorem implies that if we select $\gamma$ properly, the variance of the empirical semi-supervised risk is strictly smaller than that of the empirical supervised risk. Since the empirical semi-supervised risk is unbiased and has a smaller variance than the standard supervised risk, the former risk estimator is expected to be more accurate and stable, as we will see experimentally in Section 6.

5 Practical Implementation

With Theorem 4, we can obtain a risk estimator that can fully use both labeled and unlabeled data. It is also straightforward to see that our risk estimator based on Theorem 4 is unbiased. However, to use our risk estimator effectively in practice, one important question is: how can we decide the class $k$ to calculate $\widehat{\mathcal{S}}_{\mathrm{LU}}^{\backslash k}(g)$ ? We discuss strategies to handle this problem theoretically.

5.1 Strategies to Remove One Class for $\widehat{\mathcal{S}}_{\mathrm{LU}}^{\backslash k}(g)$

If the choice of $k$ is highly sensitive, it could be cumbersome to tune this hyperparameter as the number of classes $K$ increases. Here, we consider two strategies for selecting a class $k$ to removed in $\widehat{\mathcal{S}}_{\mathrm{LU}}^{\backslash k}(g)$ motivated by the property of the estimator. We provide two strategies.

The first strategy is based on finite sample estimation error. More specifically, when we approximate the expectation term by a limited number of samples, the variance of the estimator can be large. Then a naive strategy would be to remove the class that contains the smallest number of labeled data as

[TABLE]

The other strategy is based on the estimation error bound. As discussed in Theorem 9, the convergence rate of the estimation error for $\widehat{\mathcal{S}}_{\mathrm{LU}}^{\backslash k}(g)$ is $\mathcal{O}_{p}(\sum_{y\in\mathcal{Y}^{\backslash k}}{\pi_{y}}/{\sqrt{n_{y}}})$ under the assumptions that we have enough unlabeled data $(n_{\mathrm{U}}\rightarrow\infty)$ . This implies the following proposition which provides a strategy to decide the removed class $k$ .

Proposition 13.

Assume that labeled data are obtained under the assumption in (14), and the above assumptions are satisfied. Then,

[TABLE]

gives the lowest upper bound of the estimation error as $n_{\mathrm{L}}\rightarrow\infty$ .

It is interesting to see that the above two strategies give completely opposite solutions as the removed class. We will experimentally compare these strategies in Section 6, where we find that both strategies performed similarly although they are completely different, indicating that the choice of $k$ may not be critical to the performance.

5.2 Order Constraints

In ordinal regression problems, the threshold parameters should be ordered, i.e., $\theta_{1}\leq\theta_{2}\leq\cdots\leq\theta_{K-1}$ (Pedregosa et al., 2017). Here, we introduce a simple trick to constrain the threshold parameters $\bm{\theta}$ by adding the term

[TABLE]

where $\mu\geq 0$ is a regularization parameter for the order constraints. We show that this simple trick works well in the experiments. In fact, even without any regularization term on $\bm{\theta}$ , we empirically observed that the values of threshold constraints $\bm{\theta}$ are usually ordered. This observation suggests that the order constraints are not difficult to satisfy when optimizing $\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}$ . This threshold regularization scheme is based on the idea of the log-barrier method (Boyd and Vandenberghe, 2004), in which we impose a high cost $-\log(-c(\bm{\theta}))$ to the objective function for an inequality constraint $c(\bm{\theta})\leq 0$ for $c:\Theta\rightarrow\mathbb{R}$ . However, only using $-\log(\theta_{i+1}-\theta_{i})$ may lead to the following two problems. First, the algorithm may make $\theta_{i+1}-\theta_{i}$ large more than necessary while $\theta_{i+1}-\theta_{i}\geq 0$ is only required. Second, the function $-\log(\theta_{i+1}-\theta_{i})$ becomes negative if $\theta_{i+1}-\theta_{i}>1$ . This may cause an overall risk to be arbitrarily negative by making $\theta_{i+1}-\theta_{i}$ larger, and cause instability in the optimization procedure when combining it with the empirical version of (15). As a result, we further introduce $\max\{0,\cdot\}$ in (32) to mitigate these problems.

Here, we investigate the objective function when the linear-in-parameter model $f(\bm{x})=\bm{w}^{\top}\bm{\phi}(\bm{x})$ is employed as $f$ . Our next theorem states a sufficient condition to guarantee that, for a certain task surrogate loss function, the optimization problem is convex with respect to both the model and order constraints parameters.

Theorem 14.

Let $C_{\ell}$ be a positive constant. If the AT or LS losses are adopted as the task surrogate loss, and the binary surrogate loss $\ell(z)$ is convex and satisfies

[TABLE]

then the objective function $\widehat{J}_{\ell}(\bm{w},\bm{\theta})\coloneqq\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)+\Omega(\bm{\theta})$ is convex with respect to $\bm{w}$ and $\bm{\theta}$ .

The proof of Theorem 14 is given in Appendix G. The condition (33) is known as the linear-odd condition (Patrini et al., 2016). Examples of binary surrogate losses are shown in Figure 1 and Table 2. In our experiments, we used the logistic loss as the binary surrogate loss. Note that the objective function with the IT loss is not always convex.

5.3 Non-Negative Risk Estimator

We discuss a risk modification technique called a non-negative risk estimator. This technique was originally proposed in Kiryo et al. (2017) for classification from positive and unlabeled data based on an unbiased risk estimator. We can see that the sum of the second and third terms of our empirical risk estimator (16), $\widehat{\mathcal{S}}_{\mathrm{U}}(g)-\widehat{\mathcal{S}}_{\mathrm{L}}^{(2)}(g)$ , can be negative while the corresponding expected risk $\mathcal{S}_{\mathrm{U}}(g)-\mathcal{S}_{\mathrm{L}}^{(2)}(g)=\pi_{k}\mathbb{E}_{X|Y=k}\left[\psi(\bm{\alpha}(X),k)\right]$ is always non-negative. This observation suggests that the model can excessively reduce the empirical risk by maximizing $\widehat{\mathcal{S}}_{\mathrm{L}}^{(2)}(g)$ . To prevent this problem, following Kiryo et al. (2017), we modify the empirical risk estimator (16) as

[TABLE]

This approach is later generalized in Lu et al. (2020) in the context of unlabeled-unlabeled classification. Specifically, they considered the LeakyReLU-like regularization function instead of the max operator used in (34). They empirically showed that this regularization technique makes trained models generalize better. Following this observation, we proposed to modify our empirical risk estimator (16) as

[TABLE]

where the Generalized Leaky ReLU function $\mathrm{GLReLU}:\mathbb{R}\rightarrow\mathbb{R}$ is defined as $\mathrm{GLReLU}(x)\coloneqq x\mathbbm{1}{\left[{x\geq 0}\right]}+\lambda x\mathbbm{1}{\left[{x<0}\right]}$ for $\lambda\leq 0$ . We will employ this regularization technique in experiments.

6 Experiments

In this section, we begin by numerically investigating that unlabeled data helps reduce the variance of the risk estimator as theoretically discussed in Section 4. Besides, we experimentally investigate which strategy is better to remove one class for $\widehat{\mathcal{S}}_{\mathrm{LU}}^{\backslash k}(g)$ as discussed in Section 5.1. Finally, we present the experimental comparison results of semi-supervised ordinal regression on benchmark datasets.

Common Setup. As evaluation metrics, we adopted the mean absolute error, mean zero-one error, and mean squared error. Note that each metric coincides with the task surrogate risk, which adopts the AT, IT, and LS losses as the task surrogate loss, respectively. We did experiments on the AT, IT, and LS losses as the task surrogate loss and the logistic loss as the binary surrogate loss. For validation of the hyperparameters, we used the hold-out method by splitting the training data set with the ratio of $2:1$ . For both models, we fixed the hyper-parameters $\gamma$ and $\mu$ to $0.5$ and $10$ , respectively. Note that we empirically observed that if $\mu$ is large enough, $\mu$ is insensitive to the performance. We ran the experiments $20$ times to calculate the mean and standard error of the performances. Also, we used the non-negative risk estimator described in Section 5.3 with $\lambda=-0.2$ to prevent over-fitting. We used Chainer (Tokui et al., 2019) to implement our models. We obtained datasets from a survey paper on ordinal regression (Gutierrez et al., 2016), and the website on ordinal regression benchmark data333https://www.gagolewski.com/resources/data/ordinal-regression/. The detail of the dataset description is given in the following.

Baselines. We compared our proposed methods (SEMI) against the following baselines.

Supervised Ordinal Regression (SV): Supervised ordinal regression here is based on empirical risk minimization of the task surrogate risk, which is described in Pedregosa et al. (2017) and Section 2. Recall that the different task surrogate losses are used for each objective function. In other words, the AT, IT, and LS losses are used when the mean absolute error, mean zero-one error, and mean squared error are used as evaluation metrics, respectively. 2. 2.

Transductive Ordinal Regression (TOR): Transductive Ordinal Regression (Seah et al., 2012) is a method based on pseudo-labeling, and TOR assumes that the data satisfies the cluster assumption. TOR tries to minimize objective function by repeatedly minimizing the sum of risks on labeled data and pseudo-labeled data. Note that the classifier obtained by the training can be used as an inductive classifier. We noticed that the TOR is essentially a pseudo-labeling method based on supervised empirical risk minimization with AT loss, while it is not mentioned in the paper, implying that we can extend the use IT and LS loss instead of AT loss. Thus, we used IT and LS loss in our experiments when adopting mean zero-one error and mean squared error, respectively. 3. 3.

Semi-Supervised Manifold Ordinal Regression (SSMOR): Semi-Supervised Manifold Ordinal Regression (Liu et al., 2011) is a method based on the manifold assumption. SSMOR first minimizes the objective function, which is the sum of smoothness constraint (nearby points should have the same label) and fitting constraint (fitting on the given labels). After optimizing the objective function, SSMOR tries to obtain labels for new data points by calculating thresholds in a heuristic manner. However, we found that the threshold calculated via this approach can be not ordered, and its performance is poor. Thus, we first sorted the data points based on the prediction score and then allocated labels with the class prior.

6.1 Variance Reduction and Performance Comparison on Small Datasets

Here, we empirically investigate the effect of the variance reduction discussed in Section 4, and compare the empirical performance of proposed method on small datasets. The class size was fixed to $K=3$ by merging some classes into one class for each dataset. We sub-sampled labeled data with $n_{\mathrm{L}}=20$ . while the number of unlabeled data depends on the dataset size. Full dataset statistics is given in Appendix H.1. We used a linear-in-input model $f(\bm{x})=\bm{w}^{\top}\bm{x}$ . For training our proposed method, we trained the model for $1000$ epochs using Adam with $\alpha=5.0\times 10^{-2}$ (full batch size). The candidates of the weight decay parameter were $\{10^{-6},10^{-4},10^{-2}\}$ .

Variance Reduction. Here, before going to the performance comparison, we investigate the effect of variance reduction by using unlabeled data when estimating the training risk. We adopted the randomly initialized ordinal regressor $g_{\mathrm{rand}}$ , which is the linear-in-input model, to compute the variance of the empirical supervised risk $\operatorname*{Var}[\widehat{\mathcal{S}}(g_{\mathrm{rand}})]$ and the empirical semi-supervised risk $\operatorname*{Var}[\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g_{\mathrm{rand}})]$ . We selected the removed class $k$ using the strategies described in Section 5.1. We denote SEMI1 as our proposed semi-supervised methods, where the strategy of removing a class is based on the finite sample approximation (30), while SEMI2 denotes the strategy based on the estimation error bound (31). Then, we calculated the ratio between the variance of the empirical semi-supervised risk and supervised risk

[TABLE]

Figure 2 shows the ratio between the variance of the empirical semi-supervised risk and supervised risk, which is evaluated on different $\gamma\in\{0.0,0.1,0.2,\ldots,1.0\}$ . Additional results of experiments are given in Appendix H.2. It can be observed that the ratios were less than $1$ in most cases. This indicates that the variance of our empirical semi-supervised risk is smaller than that of the empirical supervised risk, as suggested by Theorem 12.

Performance Comparison. Tables 3 shows the performances of the proposed framework against the supervised method. We can see that there is no clear difference between the simple strategy (SEMI1) and the strategy based on the estimation error bound (SEMI2). As for the performance comparison, We can see that the proposed methods (SEMI1, SEMI2) perform better than the standard supervised learning (SV). It is worth noting that our framework performs well in all task losses, indicating the broad applicability of our proposed framework.

The Effect of the Number of Unlabeled Data. Here, we investigate the effect of the number of unlabeled data on the performance improvement against the supervised empirical risk minimization approach. We first trained the ordinal regressors with the linear-in-input model for a different number of unlabeled data. Then, we calculated the ratio between the error of the semi-supervised method to the supervised method. Note again that the different evaluation metrics are used for the different task surrogate losses.

Figure 3 shows the ratio between the error of the empirical semi-supervised risk and supervised risk, which is evaluated on the number of unlabeled data. Additional results of experiments are given in Appendix H.2. It can be observed that SEMI1 and SEMI2 outperform the supervised method. However, it is also observed that the performance gain is saturated at a certain point depending on the dataset, i.e., increasing more unlabeled data may no longer improve the performance of both SEMI1 and SEMI2. Note that this is also a common phenomenon observed in the literature of semi-supervised learning (Sakai et al., 2017; Oliver et al., 2018; Sakai et al., 2018). Nevertheless, it is a challenging future work to consider a way to improve the way to utilize unlabeled data, for example, by developing a new regularization technique for the unbiased risk estimator approach in the context of semi-supervised learning.

6.2 Performance Comparison on Large Datasets

Here, we compare the empirical performance of proposed framework on large datasets. We sub-sampled labeled data with $n_{\mathrm{L}}=500$ , and $n_{\mathrm{U}}=2000$ , and the number of test size was fixed to $2000$ . The statistics of datasets used in the experiments is given in Appendix H. We used linear-in-parameter models with Gaussian kernel $f(\bm{x})=\sum_{i=1}^{n_{\mathrm{L}}}w_{i}\exp(-\left\|\bm{x}-\bm{x}_{i}\right\|^{2}/\left(2\sigma^{2}\right))$ , and neural networks with one hidden layer (of size $256$ ) and ReLU as an activation function. For the Gaussian kernel, the bandwidth candidates were $\{0.5,1.0,2.0\}\cdot\textrm{median}(\left\|\bm{x}_{i}-\bm{x}_{j}\right\|_{i,j=1}^{n_{\mathrm{L}}})$ where $\{\bm{x}_{i}\}_{i=1}^{n_{\mathrm{L}}}=\mathcal{X}_{\mathrm{L}}$ . For training our proposed method, we trained the model for $1000$ epochs using Adam with $\alpha=5.0\times 10^{-3}$ (full batch size). The weight decay parameter was fixed to $10^{-4}$ for both models.

Performance Comparison. Tables 4 to 6 show the performances of the proposed methods against the baseline methods. We can see that the proposed methods (SEMI1-NN and SEMI2-NN) perform best than the baseline methods on most benchmark datasets. Specifically, we can observe that the method based on the manifold assumption (SSMOR) performs poorly while the performance method based on the cluster assumption (TOR) is somewhat comparable with the proposed methods. This is because the datasets used in the experiments do not satisfy the manifold assumption as the SSMOR is original proposed in the context of image ranking, while TOR does not heavily rely on the geometric assumption. As well as the case for the small datasets, our framework performs well in all task losses, indicating the broad applicability of our proposed framework. However, unlike neural networks, when a linear-in-parameter model with Gaussian kernel is used (SEMI1-Ker and SEMI2-Ker), we found that our method failed to improve the performance over the supervised method. This could be due to the risk modification in (35) might not be suitable for a kernel model since it was originally designed for using with neural networks (Lu et al., 2020). We hypothesize that the poor performance of the kernel method when using (35) can be attributed to the fact that the modification makes the risk formulation to be no longer convex, which causes difficulty in optimization. Note that the supervised baseline does not use the risk modification in (35). Hence, the supervised baseline has a convex formulation and therefore is easier to train. We note that in the case of neural networks, the optimization problem is non-convex regardless of whether we use a risk modification or not. Thus, the modification does not affect the convexity of the problem To improve the performance of the kernel model with our proposed method, it is an important future work to explore a risk modification that can retain the convexity of the formulation.

7 Conclusions

We presented a novel framework to incorporate unlabeled data in ordinal regression based on empirical risk minimization. We proposed an unbiased risk estimator that is applicable to all well-known task losses, such as the absolute loss, squared loss, and zero-one loss. We also elucidated the property of the proposed unbiased risk estimator through the analysis of the estimation error bound to guarantee that the proposed risk estimator is consistent. Experimental results showed that our proposed framework could effectively make use of unlabeled data, resulting in better scores compared to the standard supervised learning method in three task losses in terms of the mean absolute error, mean zero-one error, and mean squared error. In the future work, we plan to consider a new regularization technique for the unbiased risk estimator approach in the context of semi-supervised learning so that the method can much more utilize the information of unlabeled data.

Acknowledgements

The authors would like to thank Han Bao, Yusuke Konno, and Tomoya Sakai for helpful discussions, and Kento Nozawa, Ikko Yamane, and Soma Yokoi for maintaining servers for our experiments. TT was supported by JST AIP Challenge Program, UTokyo Toyota-Dwango AI Scholarship, and Softbank AI Scholarship. NC was supported by MEXT Scholarship, JST AIP Challenge Program, and Google PhD Fellowship Program. IS was supported by JST CREST Grant Number JPMJCR17A1, and MS was supported by JST CREST JPMJCR18A2.

Appendix A Proof of Lemma 3

Proof.

We can rewrite the task surrogate risk as follows:

[TABLE]

By expanding the marginal distribution, we have

[TABLE]

Then, we can express $\pi_{k}\mathbb{E}_{X|Y=k}\left[\psi(\bm{\alpha}(X),k)\right]$ in terms of the expectation of unlabeled data and the class-conditional expectations of all classes except class $k$ as

[TABLE]

Replacing the last term of the right-hand side of (A) with the right-hand side of (38), we complete the proof. ∎

Appendix B Proof of Proposition 5

Proof.

From Theorem 4, we have $\mathcal{S}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)=\mathcal{S}(g)$ . Also it is known that $\mathcal{S}(g)$ is Fisher-consistent to $\mathcal{R}(g)$ , if we adopt the all threshold, cumulative link, least absolute deviation, immediate threshold, or least squares as the task surrogate loss $\psi$ (Pedregosa et al., 2017). By combining all of the results mentioned above, the proof is completed. ∎

Appendix C Technical Lemmas

The following lemma is used when deriving the estimation error bounds in Appendix D.

Lemma 15 (McDiarmid’s inequality (Theorem D.8 in Mohri et al. (2018))).

Let $X_{1},\dots,X_{n}$ be a set of independent random variables taking values in $\mathcal{X}$ and assume that there exists $c_{1},\dots,c_{n}>0$ such that $f:\mathcal{X}^{n}\rightarrow\mathbb{R}$ satisfies

[TABLE]

for all $i\in\{1,\dots,n\}$ and any $x_{1},\dots,x_{n},x^{\prime}_{i}\in\mathcal{X}$ . Then,

[TABLE]

The following lemma is used when bounding the Rademacher complexities of hypothesis classes in Appendix E.

Lemma 16 (Talagrand’s lemma (Lemma 5.7 in Mohri et al. (2018))).

Let $\Phi:\mathbb{R}\rightarrow\mathbb{R}$ be $\rho$ -Lipschitz function. Then, for any set $\mathcal{H}\subset\mathbb{R}^{\mathcal{X}}$ ,

[TABLE]

Appendix D Analysis of Estimation Error Bounds (Proof of Theorem 9)

Before proving Theorem 9, we define and recall notations and give some lemmas. Defining

[TABLE]

we have

[TABLE]

where we recall that

[TABLE]

In a similar manner, we define the empirical version of each risk. Defining

[TABLE]

we have

[TABLE]

where we recall that

[TABLE]

First, we prove three lemmas in the following.

Lemma 17.

Assume that there exists a constant $C_{\psi}>0$ such that $\psi(y,\bm{\alpha})\leq C_{\psi}$ for any $y\in\mathbb{R},\ \bm{\alpha}\in\mathbb{R}^{K-1}$ . Then, for any $\delta>0$ , with probability at least $1-\frac{K-1}{K+1}\delta$ ,

[TABLE]

Proof.

Recalling that

[TABLE]

we have

[TABLE]

where $\widehat{\mathcal{S}}_{\mathrm{L},y}^{m}(g)$ is the slight modification of $\widehat{\mathcal{S}}_{\mathrm{L},y}(g)$ where $\bm{x}^{y}_{m}$ is changed to $\bm{x}^{y}_{m^{\prime}}$ . Thus, by McDiarmid’s inequality (Lemma 15 in Appendix C), we have

[TABLE]

where $S_{y}|Y=y$ denotes the distribution over $\mathcal{X}^{n_{y}}$ when conditioned as $Y=y$ . Therefore, with probability at least $1-\frac{\delta}{2(K+1)}$ , it holds that

[TABLE]

The first term of the right-hand side of the above inequality can be bounded as

[TABLE]

where (44) and (46) follow since the subadditivity of the supremum, (45) follows from the fact that the distribution $S_{y}|Y=y$ and $S^{\prime}_{y}|Y=y$ are the same, and in the last inequality we the fact that $-\sigma_{j}$ and $\sigma_{j}$ follow the same distribution. Repeating the same argument and together with (42), de Morgan’s Laws and the union bound, wit probability at least $1-\frac{\delta}{K+1}$ , we have

[TABLE]

Therefore, Lemma 17 holds as a consequence of the inequality

[TABLE]

de Morgan’s Laws, and the union bound. ∎

Lemma 18.

Assume that there exists a constant $C_{\psi}>0$ such that $\psi(y,\bm{\alpha})\leq C_{\psi}$ for any $y\in\mathcal{Y},\ \bm{\alpha}\in\mathbb{R}^{K-1}$ . Then, for any $\delta>0$ , with probability at least $1-\frac{\delta}{K+1}$ ,

[TABLE]

Proof.

This lemma can be proven similarly to Lemma 17. ∎

Lemma 19.

Assume that there exists a constant $C_{\psi}>0$ such that $\psi(y,\bm{\alpha})\leq C_{\psi}$ for any $y\in\mathcal{Y},\ \bm{\alpha}\in\mathbb{R}^{K-1}$ . Then, for any $\delta>0$ , with probability at least $1-\frac{\delta}{K+1}$ ,

[TABLE]

Proof.

This lemma can be proven similarly to Lemma 17. ∎

Proof of Theorem 9.

Let us bound the estimation error $\mathcal{S}(\widehat{g}^{\backslash k})-\mathcal{S}(g^{*})$ . Note that the equality $\mathcal{S}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)=\mathcal{S}(g)$ holds for any $g\in\mathcal{G}$ . Then, we have

[TABLE]

where (49) follows since $\widehat{g}^{\backslash k}$ is the minimizer of $\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}$ , and the last inequality follows from the subadditivity of the supremum. Hence, combining Eq. (50) and Lemmas 17 to 19, with probability at least $1-\delta$ , we have

[TABLE]

Combining the above inequality with Theorems 7 and 8, we completed the proof of Theorem 9. ∎

Appendix E Upper Bounds for Rademacher Complexities in Ordinal Regression (Proof of Lemma 10)

Before proving Lemma 10, we prepare the following lemma and two theorems. In this section, we write $[n]\coloneqq\{1,\dots,n\}$ for $n\in\mathbb{N}$ . Recall that

[TABLE]

and

[TABLE]

Lemma 20.

Let $n\in\mathbb{N}$ and $\mathcal{Z}\subset\mathcal{Y}$ , and assume that

[TABLE]

for $y\in\mathcal{Z}$ . Then,

[TABLE]

Proof.

Using the identity $\mathbbm{1}{\left[{y=y_{i}}\right]}=\frac{1}{2}+\frac{2\,\mathbbm{1}{\left[{y=y_{i}}\right]}-1}{2}$ , we have

[TABLE]

where (56) and (57) follow from the subadditivity of the supremum, (58) follows from the fact that $\sigma_{i}(2\,\mathbbm{1}{\left[{y=y_{i}}\right]}-1)$ and $\sigma_{i}$ follow the same distribution, and the last inequality follows from the assumption. ∎

Next, we give the upper bounds for the Rademacher complexities with $\mathcal{H}_{\mathrm{AT}}^{(y)}$ , $\mathcal{H}_{\mathrm{IT}}^{(y)}$ , and $\mathcal{H}_{\mathrm{LS}}^{(y)}$ without any assumptions on the decision functions class $\mathcal{F}$ . Let us recall the theorem for the convenience.

Theorem 7.

Fix $y\in\mathcal{Y}$ . Let $n\in\mathbb{N}$ assume for the AT and IT losses that the binary surrogate loss $\ell$ is $\rho$ -Lipschitz. Then, the expected Rademacher complexities of $\mathcal{H}_{\mathrm{AT}}^{(y)},\mathcal{H}_{\mathrm{IT}}^{(y)}$ , and $\mathcal{H}_{\mathrm{LS}}^{(y)}$ are bounded as

[TABLE]

where $\mathrm{sq}:\mathrm{Im}\,f\ni z\mapsto z^{2}\in\mathbb{R}$ .

Proof.

We prove the theorem for each loss.

AT loss. We have

[TABLE]

where (61) and (62) follow from the subadditivity of the supremum, (63) follow from the fact that $-\sigma_{i}$ and $\sigma_{i}$ follows the same distribution, and the last inequality follows from the Talagrand’s lemma (Lemma 16 in Appendix C).

IT loss. The proof for the case of the IT loss follows a very similar argument to that of the AT loss, and we include the proof for completeness. We have

[TABLE]

where (65) and (66) follows from the subadditivity of the supremum, the last inequality follows from the fact that $-\sigma_{i}$ and $\sigma_{i}$ follows the same distribution and the Talagrand’s lemma.

LS loss. We give the upper bound for the case of the LS loss. We have

[TABLE]

The first term in (68) is 0. The third term in (68) can be rewritten as

[TABLE]

where in (69) we used the fact that $-\sigma_{i}\,\mathrm{sign}(y+\theta_{1}-3/2)$ and $\sigma_{i}$ follow the same distribution. Summing up the above argument, the proof is completed. ∎

Next, we give the upper bounds for the Rademacher complexities with $\mathcal{H}_{\mathrm{AT}}$ , $\mathcal{H}_{\mathrm{IT}}$ , and $\mathcal{H}_{\mathrm{LS}}$ without any assumptions on a class of decision functions $\mathcal{F}$ using Lemma 20 and Theorem 7 Let us recall the theorem for the convenience.

Theorem 8 (Upper bounds of Rademacher complexities with general decision functions class).

Let $n\in\mathbb{N}$ and assume for the AT and IT losses that the binary surrogate loss $\ell$ is $\rho$ -Lipschitz. Then, the expected Rademacher complexities of $\mathcal{H}_{\mathrm{AT}},\mathcal{H}_{\mathrm{IT}}$ , and $\mathcal{H}_{\mathrm{LS}}$ are bounded as

[TABLE]

where $\mathrm{sq}:\mathrm{Im}\,f\ni z\mapsto z^{2}\in\mathbb{R}$ .

Proof of Theorem 8.

For the AT and IT losses, the result of Theorem 8 directly follows from Lemma 20 and Theorem 7.

AT loss. For the case of the AT loss, we have

[TABLE]

where in the first inequality we used Lemma 20, and in the last inequality we used Theorem 7.

IT loss. In a similar manner, we can bound for the case of IT loss as

[TABLE]

LS loss. Next, we give the upper bound for the case of the LS loss, where we will not rely on Lemma 20 and Theorem 7. We have

[TABLE]

The first term in (74) is 0. The second term in (74) is bounded as

[TABLE]

since $\sigma_{i}$ and $-\sigma_{i}$ follow the same distribution. The forth term in (74) can be rewritten as

[TABLE]

where in (76) we used the fact that $-\sigma_{i}\,\mathrm{sign}(\theta_{1}-3/2)$ and $\sigma_{i}$ follow the same distribution. Summing up the above argument, the proof is completed. ∎

*Remark**.*

The upper bound for the case of the LS loss can be similarly obtained as the cases of the AT and IT losses. However, it loosen the bound by the factor of $K$ compared to the analysis given above.

The next lemma is the well-known bound for the Rademacher complexity of the linear-in-parameter models.

Lemma 21 (Theorem 5.10 in Mohri et al. (2018)).

Let $\mathcal{F}$ be the class of linear-in-parameter models, that is,

[TABLE]

for positive constants $C_{\bm{w}}$ and $C_{\bm{\phi}}$ . Then,

[TABLE]

Finally, we prove the Lemma 10 in the following.

Proof of Lemma 10.

The result of Lemma 10 directly follows from Theorems 7, 8 and Lemma 21 as follows.

AT and IT losses. Fix $j\in[K]$ . Being aware that we have

[TABLE]

and

[TABLE]

we can use Lemma 21 and have

[TABLE]

In a similar manner, using Theorem 8, we have

[TABLE]

and

[TABLE]

which completes the proof for the cases of the AT and IT losses.

LS loss. Since $|f(\bm{x})|\leq\|w\|_{2}\|\bm{\phi}(\bm{x})\|_{2}\leq C_{\bm{w}}C_{\bm{\phi}}$ by the Cauchy–Schwarz inequality, the map $\mathrm{sq}:\mathrm{Im}\,f\rightarrow\mathbb{R}$ is $2C_{\bm{w}}C_{\bm{\phi}}$ -Lipschitz. Therefore, using Theorem 8 and the Talagrand’s lemma, we have

[TABLE]

where in the last inequality we used Lemma 21.

Using similar manners as the above arguments, we can obtain the upper bounds for the cases with $\mathcal{H}_{\mathrm{AT}}$ , $\mathcal{H}_{\mathrm{IT}}$ , and $\mathcal{H}_{\mathrm{LS}}$ . ∎

Appendix F Proof of Theorem 12

Proof.

Take any $g\in\mathcal{G}$ . Recalling that $\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)=\gamma\widehat{\mathcal{S}}_{\mathrm{LU}}^{\backslash k}+(1-\gamma)\widehat{\mathcal{S}}$ , we can rewrite the variance of the empirical semi-supervised risk $\operatorname*{Var}[\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)]$ as

[TABLE]

Hence, the condition $\operatorname*{Var}[\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)]<\operatorname*{Var}[\widehat{\mathcal{S}}(g)]$ is equivalent to

[TABLE]

This condition is indeed satisfied by taking $\gamma$ satisfying

[TABLE]

and the proof is completed. ∎

Appendix G Proof of Theorem 14

Proof.

If $\widehat{\mathcal{S}}_{\mathrm{L}}(g)$ is shown to be convex, the objective function $\widehat{J}_{\ell}(\bm{w},\bm{\theta})=\widehat{\mathcal{S}}_{\mathrm{SEMI}\text{-}\gamma}^{\backslash k}(g)+\Omega(\bm{\theta})$ is convex with respect to $\bm{w}$ and $\bm{\theta}$ since the other terms are convex.

AT loss. Recalling that the AT loss is defined as

[TABLE]

we have

[TABLE]

Then, the term inside the sum can be rewritten as

[TABLE]

Therefore, $\widehat{\mathcal{S}}_{\mathrm{L}}(g)$ is convex and the objective function $\widehat{J}_{\ell}(\bm{w},\bm{\theta})$ is convex.

LS loss. Recalling that the LS loss is defined as

[TABLE]

we have

[TABLE]

This is convex and subsequently the objective function $\widehat{J}_{\ell}(\bm{w},\bm{\theta})$ is convex, and the proof is completed. ∎

Appendix H Dataset Statistics and Additional Results of Experiments

H.1 Benchmark Dataset Statistics

The detailed dataset statistics used in the small and large scale experiments are given in Tables 7 and 8. The other statistics is given in Section 6.

H.2 Additional Results of Experiments

Here we show the extended experimental results for the variance reduction experiments discussed in Section 6. Figures 4 to 6 show the results of the variance comparison. We can see that the variances of the semi-supervised risks are smaller than that of the supervised risk. Figures 7 to 9 show that the effect of the number of unlabeled on the performance improvement. Note that the experiments on toy dataset is not conducted because its number of unlabeled data is too small. We can see that the performance of the proposed method is indeed improved, but the improvement by adding a large amount of data is limited except when using LS as the task surrogate loss.

Bibliography37

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bao et al. (2018) Han Bao, Gang Niu, and Masashi Sugiyama. Classification from pairwise similarity and unlabeled data. In Proceedings of the 35th International Conference on Machine Learning , pages 452–461, 2018.
2Bartlett and Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research , 3:463–482, 2002.
3Belkin et al. (2006) Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research , 7:2399–2434, 2006.
4Ben-David et al. (2003) Shai Ben-David, Nadav Eiron, and Philip M Long. On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences , 66(3):496–514, 2003.
5Bender and Grouven (1997) Ralf Bender and Ulrich Grouven. Ordinal logistic regression in medical research. Journal of the Royal College of physicians of London , 31(5):546–551, 1997.
6Bender and Grouven (1998) Ralf Bender and Ulrich Grouven. Using binary logistic regression models for ordinal data with non-proportional odds. Journal of Clinical Epidemiology , 51(10):809–816, 1998.
7Boyd and Vandenberghe (2004) Stephen Boyd and Lieven Vandenberghe. Convex Optimization . Cambridge University Press, 2004.
8Chapelle et al. (2006) Olivier Chapelle, Bernhard SchÃ¶lkopf, and Alexander Zien. Semi-Supervised Learning . MIT Press, 2006.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Semi-Supervised Ordinal Regression

Abstract

1 Introduction

2 Preliminaries

2.1 Supervised Ordinal Regression

Proposition 1** (Pedregosa et al. (2017)).**

Proposition 2** (Pedregosa et al. (2017)).**

2.2 Surrogate Losses for Ordinal Regression

3 Proposed Framework

Lemma 3**.**

Theorem 4**.**

4 Theoretical Analysis

4.1 Fisher-Consistency

Proposition 5** (Fisher-Consistency).**

4.2 Estimation Error Bound

Definition 6** (Expected Rademacher complexity).**

Theorem 7**.**

Theorem 8** (Upper bounds of Rademacher complexities of task surrogate losses).**

Remark*.*

Theorem 9** (Estimation error bounds with general decision functions).**

Lemma 10**.**

Corollary 11** (Estimation error bounds with linear-in-parameter models).**

4.3 Variance Reduction

Theorem 12** (Variance reduction).**

5 Practical Implementation

5.1 Strategies to Remove One Class for S^LU\k(g)\widehat{\mathcal{S}}_{\mathrm{LU}}^{\backslash k}(g)SLU\k​(g)

Proposition 13**.**

5.2 Order Constraints

Theorem 14**.**

5.3 Non-Negative Risk Estimator

6 Experiments

6.1 Variance Reduction and Performance Comparison on Small Datasets

6.2 Performance Comparison on Large Datasets

7 Conclusions

Acknowledgements

Appendix A Proof of Lemma 3

Proof.

Appendix B Proof of Proposition 5

Proof.

Appendix C Technical Lemmas

Lemma 15** (McDiarmid’s inequality (Theorem D.8 in Mohri et al. (2018))).**

Lemma 16** (Talagrand’s lemma (Lemma 5.7 in Mohri et al. (2018))).**

Appendix D Analysis of Estimation Error Bounds (Proof of Theorem 9)

Lemma 17**.**

Proof.

Lemma 18**.**

Proof.

Lemma 19**.**

Proof.

Proof of Theorem 9.

Appendix E Upper Bounds for Rademacher Complexities in Ordinal Regression (Proof of Lemma 10)

Lemma 20**.**

Proof.

Theorem 7.

Proof.

Theorem 8 (Upper bounds of Rademacher complexities with general decision functions class).

Proof of Theorem 8.

Remark*.*

Lemma 21** (Theorem 5.10 in Mohri et al. (2018)).**

Proof of Lemma 10.

Appendix F Proof of Theorem 12

Proof.

Appendix G Proof of Theorem 14

Proof.

Appendix H Dataset Statistics and Additional Results of Experiments

H.1 Benchmark Dataset Statistics

H.2 Additional Results of Experiments

Proposition 1 (Pedregosa et al. (2017)).

Proposition 2 (Pedregosa et al. (2017)).

Lemma 3.

Theorem 4.

Proposition 5 (Fisher-Consistency).

Definition 6 (Expected Rademacher complexity).

Theorem 7.

Theorem 8 (Upper bounds of Rademacher complexities of task surrogate losses).

*Remark**.*

Theorem 9 (Estimation error bounds with general decision functions).

Lemma 10.

Corollary 11 (Estimation error bounds with linear-in-parameter models).

Theorem 12 (Variance reduction).

5.1 Strategies to Remove One Class for $\widehat{\mathcal{S}}_{\mathrm{LU}}^{\backslash k}(g)$

Proposition 13.

Theorem 14.

Lemma 15 (McDiarmid’s inequality (Theorem D.8 in Mohri et al. (2018))).

Lemma 16 (Talagrand’s lemma (Lemma 5.7 in Mohri et al. (2018))).

Lemma 17.

Lemma 18.

Lemma 19.

Lemma 20.

*Remark**.*

Lemma 21 (Theorem 5.10 in Mohri et al. (2018)).