Learning Context-Dependent Choice Functions

Karlson Pfannschmidt; Pritha Gupta; Bj\"orn Haddenhorst; Eyke; H\"ullermeier

arXiv:1901.10860·cs.LG·October 25, 2021

Learning Context-Dependent Choice Functions

Karlson Pfannschmidt, Pritha Gupta, Bj\"orn Haddenhorst, Eyke, H\"ullermeier

PDF

Open Access 1 Repo

TL;DR

This paper introduces models for learning context-dependent choice functions using neural networks, addressing challenges like variable input size and order invariance, with extensive empirical validation on synthetic and real data.

Contribution

It proposes a novel framework for modeling context-dependent preferences via utility functions and develops neural network architectures to learn these functions effectively.

Findings

01

Neural network models outperform baselines on synthetic datasets.

02

Models demonstrate strong generalization to real-world choice data.

03

Approaches handle variable input sizes and order invariance effectively.

Abstract

Choice functions accept a set of alternatives as input and produce a preferred subset of these alternatives as output. We study the problem of learning such functions under conditions of context-dependence of preferences, which means that the preference in favor of a certain choice alternative may depend on what other options are also available. In spite of its practical relevance, this kind of context-dependence has received little attention in preference learning so far. We propose a suitable model based on context-dependent (latent) utility functions, thereby reducing the problem to the task of learning such utility functions. Practically, this comes with a number of challenges. For example, the set of alternatives provided as input to a choice function can be of any size, and the output of the function should not depend on the order in which the alternatives are presented. To meet…

Tables5

Table 1. Table 1: Overview of the choice datasets used in the experiments. Bracket notation is used to denote the range of values.

Problem	Dataset	# Train	# Test	# Features	$\| Q \|$
Singleton Choice	Medoid	$10 000$	$100 000$	$5$	$10$
	Hypervolume	$10 000$	$100 000$	$2$	$10$
	MNIST-Mode	$10 000$	$100 000$	$128$	$10$
	MNIST-Unique	$10 000$	$100 000$	$128$	$10$
	Tag Genome Dissimilar Movie	$10 000$	$100 000$	$1128$	$10$
	Tag Genome Similar Movie	$10 000$	$100 000$	$1128$	$10$
	LETOR-MQ $2007$ -list	$[1353, 1356]$	$[336, 339]$	$46$	$[257, 1346]$
	LETOR-MQ $2008$ -list	$[627, 628]$	$[156, 157]$	$46$	$[204, 1831]$
	Expedia	$78 041$	$312 229$	$17$	$[5, 38]$
	Sushi	$7000$	$3000$	$7$	$10$
Subset Choice	Pareto-front- $2$ D	$10 000$	$100 000$	$2$	$30$
	Pareto-front- $5$ D	$10 000$	$100 000$	$5$	$30$
	MNIST-Mode	$10 000$	$100 000$	$128$	$10$
	MNIST-Unique	$10 000$	$100 000$	$128$	$10$
	LETOR-MQ $2007$	$[1160, 1172]$	$[283, 295]$	$46$	$[6, 147]$
	LETOR-MQ $2008$	$[442, 459]$	$[105, 122]$	$46$	$[5, 121]$
	Expedia	$79 855$	$319 489$	$17$	$[5, 38]$

Table 2. Table 4: Dataset configurations for generalization experiments. Bracket notation is used to denote the range of values.

Problem	Dataset	# Features	# Train	# Test	Task set sizes $S$	Task set Size $\| Q \|$
Singleton Choice	Medoid	$5$	$10 000$	$100 000$	$[3, 30]$	$10$
Singleton Choice	Hypervolume	$2$	$10 000$	$100 000$	$[3, 30]$	$10$

Table 3. Table 8: Major Group feature description

Major Group
Value	Species	Value	Species	Value	Species
$0$	Aomono (blue-skinned fish)	$4$	Clam or shell	$8$	Other seafood
$1$	Akami (red meat fish)	$5$	Squid or octopus	$9$	Egg
$2$	Shiromi (white-meat fish)	$6$	Shrimp or crab	$10$	Meat other than fish
$3$	Tare (something like baste for eel)	$7$	Roe	$11$	Vegetables

Table 4. Table 10: Mean and standard deviation of the accuracies on the singleton choice data (measured across 5 5 5 outer cross-validation folds). Best entry for each measure marked in bold.

Dataset	SCM	Accuracy	Top- $3$	Top- $5$
Medoid	FETA-Net	$0.846 \pm 0.010$	$0.994 \pm 0.001$	$1.000 \pm 0.000$
	FATE-Net	$0.881 \pm 0.007$	$0.996 \pm 0.001$	$1.000 \pm 0.000$
	FETA-Linear	$0.356 \pm 0.026$	$0.715 \pm 0.007$	$0.883 \pm 0.011$
	SDA	$0.839 \pm 0.004$	$0.987 \pm 0.001$	$0.998 \pm 0.000$
	RankNet	$0.531 \pm 0.008$	$0.873 \pm 0.006$	$0.970 \pm 0.004$
	PairwiseSVM	$0.021 \pm 0.001$	$0.194 \pm 0.009$	$0.501 \pm 0.002$
	MNL	$0.020 \pm 0.001$	$0.191 \pm 0.005$	$0.500 \pm 0.001$
	NL	$0.049 \pm 0.014$	$0.216 \pm 0.006$	$0.463 \pm 0.027$
	GNL	$0.020 \pm 0.000$	$0.195 \pm 0.004$	$0.500 \pm 0.001$
	ML	$0.003 \pm 0.000$	$0.055 \pm 0.012$	$0.249 \pm 0.032$
Hypervolume	FETA-Net	$0.769 \pm 0.022$	$0.933 \pm 0.007$	$0.980 \pm 0.001$
	FATE-Net	$0.730 \pm 0.018$	$0.920 \pm 0.013$	$0.968 \pm 0.006$
	FETA-Linear	$0.236 \pm 0.042$	$0.404 \pm 0.042$	$0.560 \pm 0.028$
	SDA	$0.233 \pm 0.019$	$0.417 \pm 0.029$	$0.589 \pm 0.036$
	RankNet	$0.203 \pm 0.004$	$0.369 \pm 0.006$	$0.562 \pm 0.004$
	PairwiseSVM	$0.186 \pm 0.001$	$0.340 \pm 0.002$	$0.550 \pm 0.002$
	MNL	$0.201 \pm 0.008$	$0.360 \pm 0.010$	$0.559 \pm 0.004$
	NL	$0.291 \pm 0.003$	$0.511 \pm 0.007$	$0.651 \pm 0.006$
	GNL	$0.293 \pm 0.018$	$0.471 \pm 0.021$	$0.663 \pm 0.014$
	ML	$0.189 \pm 0.014$	$0.451 \pm 0.019$	$0.621 \pm 0.014$
MNIST-Unique	FETA-Net	$0.972 \pm 0.002$	$0.995 \pm 0.001$	$0.998 \pm 0.000$
	FATE-Net	$0.954 \pm 0.009$	$0.993 \pm 0.001$	$0.998 \pm 0.001$
	FETA-Linear	$0.127 \pm 0.006$	$0.320 \pm 0.003$	$0.505 \pm 0.010$
	SDA	$0.858 \pm 0.029$	$0.935 \pm 0.026$	$0.955 \pm 0.018$
	RankNet	$0.134 \pm 0.008$	$0.307 \pm 0.002$	$0.495 \pm 0.002$
	PairwiseSVM	$0.124 \pm 0.010$	$0.319 \pm 0.008$	$0.502 \pm 0.007$
	MNL	$0.170 \pm 0.006$	$0.325 \pm 0.009$	$0.495 \pm 0.002$
	NL	$0.207 \pm 0.016$	$0.354 \pm 0.004$	$0.502 \pm 0.006$
	GNL	$0.651 \pm 0.006$	$0.763 \pm 0.003$	$0.841 \pm 0.001$
	ML	$0.490 \pm 0.003$	$0.718 \pm 0.005$	$0.784 \pm 0.002$
MNIST-Mode	FETA-Net	$0.908 \pm 0.004$	$0.961 \pm 0.003$	$0.978 \pm 0.004$
	FATE-Net	$0.669 \pm 0.005$	$0.907 \pm 0.004$	$0.943 \pm 0.003$
	FETA-Linear	$0.290 \pm 0.006$	$0.674 \pm 0.010$	$0.877 \pm 0.007$
	SDA	$0.513 \pm 0.047$	$0.806 \pm 0.041$	$0.901 \pm 0.061$
	RankNet	$0.284 \pm 0.002$	$0.668 \pm 0.003$	$0.876 \pm 0.003$
	PairwiseSVM	$0.289 \pm 0.007$	$0.675 \pm 0.011$	$0.881 \pm 0.007$
	MNL	$0.285 \pm 0.006$	$0.652 \pm 0.011$	$0.853 \pm 0.010$
	NL	$0.282 \pm 0.007$	$0.646 \pm 0.012$	$0.848 \pm 0.010$
	GNL	$0.274 \pm 0.003$	$0.641 \pm 0.008$	$0.849 \pm 0.006$
	ML	$0.216 \pm 0.010$	$0.536 \pm 0.020$	$0.765 \pm 0.022$
Tag Genome Similar Movie	FETA-Net	$0.184 \pm 0.001$	$0.481 \pm 0.002$	$0.699 \pm 0.002$
	FATE-Net	$0.185 \pm 0.003$	$0.482 \pm 0.006$	$0.699 \pm 0.004$
	FETA-Linear	$0.138 \pm 0.009$	$0.391 \pm 0.023$	$0.613 \pm 0.030$
	SDA	$0.099 \pm 0.022$	$0.306 \pm 0.050$	$0.511 \pm 0.058$
	RankNet	$0.174 \pm 0.003$	$0.477 \pm 0.002$	$0.708 \pm 0.003$
	PairwiseSVM	$0.145 \pm 0.011$	$0.405 \pm 0.019$	$0.626 \pm 0.018$
	MNL	$0.179 \pm 0.002$	$0.472 \pm 0.003$	$0.694 \pm 0.004$
	NL	$0.178 \pm 0.004$	$0.467 \pm 0.006$	$0.689 \pm 0.007$
	GNL	$0.179 \pm 0.002$	$0.472 \pm 0.003$	$0.694 \pm 0.003$
	ML	$0.117 \pm 0.001$	$0.353 \pm 0.009$	$0.575 \pm 0.013$

Table 5. Table 11: Mean and standard deviation of the accuracies on the singleton choice data (measured across 5 5 5 outer cross-validation folds). Best entry for each measure marked in bold.

Dataset	SCM	Accuracy	Top- $3$	Top- $5$
Tag Genome Dissimilar Movie	FETA-Net	$0.512 \pm 0.004$	$0.835 \pm 0.004$	$0.942 \pm 0.002$
	FATE-Net	$0.510 \pm 0.001$	$0.830 \pm 0.002$	$0.938 \pm 0.002$
	FETA-Linear	$0.440 \pm 0.002$	$0.759 \pm 0.002$	$0.889 \pm 0.001$
	SDA	$0.451 \pm 0.047$	$0.694 \pm 0.072$	$0.789 \pm 0.054$
	RankNet	$0.435 \pm 0.002$	$0.779 \pm 0.001$	$0.914 \pm 0.001$
	PairwiseSVM	$0.369 \pm 0.016$	$0.712 \pm 0.012$	$0.871 \pm 0.008$
	MNL	$0.447 \pm 0.002$	$0.692 \pm 0.005$	$0.795 \pm 0.005$
	NL	$0.438 \pm 0.006$	$0.671 \pm 0.015$	$0.775 \pm 0.018$
	GNL	$0.443 \pm 0.004$	$0.681 \pm 0.010$	$0.784 \pm 0.011$
	ML	$0.417 \pm 0.003$	$0.763 \pm 0.001$	$0.895 \pm 0.005$
LETORMQ $2007$ -list	FETA-Net	$0.334 \pm 0.007$	$0.577 \pm 0.012$	$0.705 \pm 0.006$
	FATE-Net	$0.288 \pm 0.002$	$0.508 \pm 0.006$	$0.639 \pm 0.004$
	FETA-Linear	$0.293 \pm 0.018$	$0.551 \pm 0.007$	$0.697 \pm 0.007$
	SDA	$0.047 \pm 0.013$	$0.137 \pm 0.007$	$0.211 \pm 0.014$
	RankNet	$0.287 \pm 0.033$	$0.513 \pm 0.050$	$0.627 \pm 0.037$
	PairwiseSVM	$0.302 \pm 0.008$	$0.541 \pm 0.031$	$0.654 \pm 0.039$
	MNL	$0.282 \pm 0.006$	$0.503 \pm 0.029$	$0.622 \pm 0.038$
	NL	$0.285 \pm 0.018$	$0.499 \pm 0.030$	$0.608 \pm 0.043$
	GNL	$0.287 \pm 0.020$	$0.509 \pm 0.029$	$0.625 \pm 0.037$
	ML	$0.282 \pm 0.005$	$0.503 \pm 0.038$	$0.628 \pm 0.037$
LETORMQ $2008$ -list	FETA-Net	$0.266 \pm 0.015$	$0.396 \pm 0.019$	$0.504 \pm 0.017$
	FATE-Net	$0.281 \pm 0.012$	$0.369 \pm 0.015$	$0.544 \pm 0.012$
	FETA-Linear	$0.197 \pm 0.007$	$0.392 \pm 0.027$	$0.506 \pm 0.032$
	SDA	$0.028 \pm 0.007$	$0.078 \pm 0.032$	$0.124 \pm 0.034$
	RankNet	$0.225 \pm 0.026$	$0.399 \pm 0.020$	$0.501 \pm 0.023$
	PairwiseSVM	$0.203 \pm 0.014$	$0.376 \pm 0.032$	$0.497 \pm 0.021$
	MNL	$0.217 \pm 0.025$	$0.362 \pm 0.020$	$0.500 \pm 0.027$
	NL	$0.212 \pm 0.024$	$0.355 \pm 0.030$	$0.472 \pm 0.030$
	GNL	$0.222 \pm 0.020$	$0.366 \pm 0.034$	$0.494 \pm 0.026$
	ML	$0.213 \pm 0.015$	$0.367 \pm 0.019$	$0.501 \pm 0.025$
Expedia	FETA-Net	$0.215 \pm 0.006$	$0.451 \pm 0.016$	$0.587 \pm 0.008$
	FATE-Net	$0.203 \pm 0.006$	$0.434 \pm 0.003$	$0.576 \pm 0.003$
	FETA-Linear	$0.176 \pm 0.003$	$0.394 \pm 0.002$	$0.543 \pm 0.003$
	SDA	$0.115 \pm 0.008$	$0.288 \pm 0.014$	$0.431 \pm 0.015$
	RankNet	$0.210 \pm 0.001$	$0.445 \pm 0.001$	$0.590 \pm 0.001$
	PairwiseSVM	$0.179 \pm 0.000$	$0.405 \pm 0.001$	$0.550 \pm 0.000$
	MNL	$0.199 \pm 0.004$	$0.423 \pm 0.005$	$0.565 \pm 0.004$
	NL	$0.171 \pm 0.006$	$0.388 \pm 0.008$	$0.534 \pm 0.008$
	GNL	$0.168 \pm 0.006$	$0.385 \pm 0.010$	$0.531 \pm 0.009$
	ML	$0.181 \pm 0.010$	$0.406 \pm 0.010$	$0.551 \pm 0.007$
SUSHI	FETA-Net	$0.295 \pm 0.003$	$0.552 \pm 0.003$	$0.766 \pm 0.003$
	FATE-Net	$0.322 \pm 0.003$	$0.589 \pm 0.005$	$0.817 \pm 0.005$
	FETA-Linear	$0.273 \pm 0.006$	$0.500 \pm 0.014$	$0.680 \pm 0.012$
	SDA	$0.270 \pm 0.015$	$0.498 \pm 0.043$	$0.689 \pm 0.043$
	RankNet	$0.272 \pm 0.007$	$0.559 \pm 0.035$	$0.721 \pm 0.016$
	PairwiseSVM	$0.258 \pm 0.004$	$0.480 \pm 0.022$	$0.679 \pm 0.013$
	MNL	$0.271 \pm 0.004$	$0.502 \pm 0.003$	$0.677 \pm 0.010$
	NL	$0.253 \pm 0.006$	$0.533 \pm 0.019$	$0.730 \pm 0.025$
	GNL	$0.259 \pm 0.007$	$0.562 \pm 0.023$	$0.735 \pm 0.016$
	ML	$0.281 \pm 0.004$	$0.575 \pm 0.013$	$0.777 \pm 0.007$

Equations168

p (Q, C) : = p (Q) \cdot p (C ∣ Q)

p (Q, C) : = p (Q) \cdot p (C ∣ Q)

\frac{p ( C ∣ Q )}{p ( C ^{'} ∣ Q )} = \frac{p ( C ∣ Q ^{'} )}{p ( C ^{'} ∣ Q ^{'} )}

\frac{p ( C ∣ Q )}{p ( C ^{'} ∣ Q )} = \frac{p ( C ∣ Q ^{'} )}{p ( C ^{'} ∣ Q ^{'} )}

U : {(x, Q) : x \in Q \in Q} ⟶ R, (x, Q) \mapsto U (x, Q),

U : {(x, Q) : x \in Q \in Q} ⟶ R, (x, Q) \mapsto U (x, Q),

C_{singleton} (U, Q) : = x \in Q ar g max U (x, Q)

C_{singleton} (U, Q) : = x \in Q ar g max U (x, Q)

C_{subset}^{t} (U, Q) : = {x \in Q : U (x, Q) \geq t} .

C_{subset}^{t} (U, Q) : = {x \in Q : U (x, Q) \geq t} .

p_{MNL} (x ∣ Q) : = \frac{exp ( U ( x , Q ) )}{\sum _{x^{'} \in Q} exp ( U ( x ^{'} , Q ))} .

p_{MNL} (x ∣ Q) : = \frac{exp ( U ( x , Q ) )}{\sum _{x^{'} \in Q} exp ( U ( x ^{'} , Q ))} .

p (C ∣ Q) : = γ (U, Q) x \in Q \prod \frac{exp ( [ [ x \in C ] ] U ( x , Q ))}{1 + exp ( U ( x , Q ))}

p (C ∣ Q) : = γ (U, Q) x \in Q \prod \frac{exp ( [ [ x \in C ] ] U ( x , Q ))}{1 + exp ( U ( x , Q ))}

\frac{p ( C ∣ Q )}{p ( C ^{'} ∣ Q )} = x \in Q \prod \frac{exp ( [ [ x \in C ] ] U ( x ))}{exp ( [ [ x \in C ^{'} ] ] U ( x ))} = x \in C \cup C^{'} \prod \frac{exp ( [ [ x \in C ] ] U ( x ))}{exp ( [ [ x \in C ^{'} ] ] U ( x ))}

\frac{p ( C ∣ Q )}{p ( C ^{'} ∣ Q )} = x \in Q \prod \frac{exp ( [ [ x \in C ] ] U ( x ))}{exp ( [ [ x \in C ^{'} ] ] U ( x ))} = x \in C \cup C^{'} \prod \frac{exp ( [ [ x \in C ] ] U ( x ))}{exp ( [ [ x \in C ^{'} ] ] U ( x ))}

p (C ∣ Q) : = \sum_{π} p (π) p (g (π) = C),

p (C ∣ Q) : = \sum_{π} p (π) p (g (π) = C),

R (c) : = \int_{Q \times C} L (C, c (Q)) d p (Q, C),

R (c) : = \int_{Q \times C} L (C, c (Q)) d p (Q, C),

c^{*} : Q \mapsto \hat{C} \in C ar g min \int_{C} L (C, \hat{C}) d p (C ∣ Q) .

c^{*} : Q \mapsto \hat{C} \in C ar g min \int_{C} L (C, \hat{C}) d p (C ∣ Q) .

R_{emp} (c) : = \frac{1}{N} i = 1 \sum N L (C_{i}, c (Q_{i}))

R_{emp} (c) : = \frac{1}{N} i = 1 \sum N L (C_{i}, c (Q_{i}))

U (x, (x_{1}, \dots, x_{k})) = U (x, (x_{σ (1)}, \dots, x_{σ (k)}))

U (x, (x_{1}, \dots, x_{k})) = U (x, (x_{σ (1)}, \dots, x_{σ (k)}))

U_{k} : D_{k} ⟶ R, D_{k} : = {(x, A) : x \in X and A \subseteq X ∖ {x} with ∣ A ∣ = k}

U_{k} : D_{k} ⟶ R, D_{k} : = {(x, A) : x \in X and A \subseteq X ∖ {x} with ∣ A ∣ = k}

U (x, Q) : = U_{0} (x) + k = 1 \sum K \overline{U}_{k} (x, Q),

U (x, Q) : = U_{0} (x) + k = 1 \sum K \overline{U}_{k} (x, Q),

\overline{U}_{k} (x, Q) = \frac{1}{( k ∣ Q ∣ ) - ( k - 1 ∣ Q ∣ - 1 )} Q^{'} \subseteq Q ∖ {x} : ∣ Q^{'} ∣ = k \sum U_{k} (x, Q^{'}) .

\overline{U}_{k} (x, Q) = \frac{1}{( k ∣ Q ∣ ) - ( k - 1 ∣ Q ∣ - 1 )} Q^{'} \subseteq Q ∖ {x} : ∣ Q^{'} ∣ = k \sum U_{k} (x, Q^{'}) .

U (x, Q) = U_{0} (x) + \frac{1}{∣ Q ∣ - 1} \sum_{y \in Q ∖ {x}} U_{1} (x, {y}) .

U (x, Q) = U_{0} (x) + \frac{1}{∣ Q ∣ - 1} \sum_{y \in Q ∖ {x}} U_{1} (x, {y}) .

\forall x \in X, \forall y \in X ∖ {x} : \tilde{U}_{1} (x, {y}) = U_{1} (x, {y}) - \tilde{U}_{0} (x) + U_{0} (x) .

\forall x \in X, \forall y \in X ∖ {x} : \tilde{U}_{1} (x, {y}) = U_{1} (x, {y}) - \tilde{U}_{0} (x) + U_{0} (x) .

U_{1} (x, {y}) - U_{1} (x, {z}) = \tilde{U}_{1} (x, {y}) - \tilde{U}_{1} (x, {z}) .

U_{1} (x, {y}) - U_{1} (x, {z}) = \tilde{U}_{1} (x, {y}) - \tilde{U}_{1} (x, {z}) .

U_{FETA}^{(U_{0}, U_{1})} (x, Q) - U_{FETA}^{(U_{0}, U_{1})} (x, Q^{'}) = \frac{1}{∣ Q ∣ - 1} (U_{1} (x, {y}) - U_{1} (x, {z})) .

U_{FETA}^{(U_{0}, U_{1})} (x, Q) - U_{FETA}^{(U_{0}, U_{1})} (x, Q^{'}) = \frac{1}{∣ Q ∣ - 1} (U_{1} (x, {y}) - U_{1} (x, {z})) .

b (x) : = {\tilde{U}_{0} (x) - U_{0} (x), U_{1} (x, {x_{0}}) - \tilde{U}_{1} (x, {x_{0}}), if x = x_{0}, if x \neq = x_{0} .

b (x) : = {\tilde{U}_{0} (x) - U_{0} (x), U_{1} (x, {x_{0}}) - \tilde{U}_{1} (x, {x_{0}}), if x = x_{0}, if x \neq = x_{0} .

\tilde{U}_{1} (x, {y}) = U_{1} (x, {y}) - (U_{1} (x, {x_{0}}) - \tilde{U}_{1} (x, {x_{0}})) = U_{1} (x, {y}) - b (x) .

\tilde{U}_{1} (x, {y}) = U_{1} (x, {y}) - (U_{1} (x, {x_{0}}) - \tilde{U}_{1} (x, {x_{0}})) = U_{1} (x, {y}) - b (x) .

\forall x \in X ∖ {x_{0}} : \forall y \in X ∖ {x} : \tilde{U}_{1} (x, {y}) = U_{1} (x, {y}) - b (x) .

\forall x \in X ∖ {x_{0}} : \forall y \in X ∖ {x} : \tilde{U}_{1} (x, {y}) = U_{1} (x, {y}) - b (x) .

\tilde{U}_{0} (x) - U_{0} (x)

\tilde{U}_{0} (x) - U_{0} (x)

= \frac{1}{∣ Q ∣ - 1} \sum_{y \in Q ∖ {x}} (U_{1} (x, {y}) - \tilde{U}_{1} (x, {y}))

= \frac{1}{∣ Q ∣ - 1} \sum_{y \in Q ∖ {x}} b (x)

= b (x) .

\forall x \in X : b (x) = \tilde{U}_{0} (x) - U_{0} (x) .

\forall x \in X : b (x) = \tilde{U}_{0} (x) - U_{0} (x) .

\forall y \in X ∖ {x_{0}} : \tilde{U}_{1} (x_{0}, {y}) = U_{1} (x_{0}, y) - b (x_{0}) .

\forall y \in X ∖ {x_{0}} : \tilde{U}_{1} (x_{0}, {y}) = U_{1} (x_{0}, y) - b (x_{0}) .

c ({a, b, c, x})

c ({a, b, c, x})

c ({a, b, c, x^{'}})

u_{a} + \frac{1}{3} (u_{a, b} + u_{a, c} + u_{a, x})

u_{a} + \frac{1}{3} (u_{a, b} + u_{a, c} + u_{a, x})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kiudee/cs-ranking
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic and Environmental Valuation · Bayesian Modeling and Causal Inference · Multi-Criteria Decision Making

Full text

\newwatermark

[firstpage,color=gray!60,angle=270,scale=0.4, xpos=3.8in,ypos=0]Publication DOI

Learning Context-Dependent Choice Functions

Karlson Pfannschmidt[

ID

](https://orcid.org/0000-0001-9407-7903)

Pritha Gupta[

ID

](https://orcid.org/0000-0002-7277-4633)

Björn Haddenhorst[

ID

](https://orcid.org/0000-0002-4023-6646)

Eyke Hüllermeier[

ID

](https://orcid.org/0000-0002-9944-4108)

Abstract

Choice functions accept a set of alternatives as input and produce a preferred subset of these alternatives as output. We study the problem of learning such functions under conditions of context-dependence of preferences, which means that the preference in favor of a certain choice alternative may depend on what other options are also available. In spite of its practical relevance, this kind of context-dependence has received little attention in preference learning so far. We propose a suitable model based on context-dependent (latent) utility functions, thereby reducing the problem to the task of learning such utility functions. Practically, this comes with a number of challenges. For example, the set of alternatives provided as input to a choice function can be of any size, and the output of the function should not depend on the order in which the alternatives are presented. To meet these requirements, we propose two general approaches based on two representations of context-dependent utility functions, as well as instantiations in the form of appropriate end-to-end trainable neural network architectures. Moreover, to demonstrate the performance of both networks, we present extensive empirical evaluations on both synthetic and real-world datasets.

K****eywords preference learning $\cdot$ choice functions $\cdot$ context-dependence $\cdot$ neural networks

1 Introduction

The notion of preference plays a central role in various scientific disciplines, such as economics, psychology, and more recently also computer science and artificial intelligence [19]. In these fields, mathematical formalisms have been developed for modelling and reasoning about preferences, and for analyzing data that originates from observed or revealed preferences. In this regard, choice observations are of specific interest, in which a subset of “good” alternatives is selected from a set of available candidates. In particular, starting with the seminal work by [6], choice functions have been analyzed as a key concept of a formal theory of choice and preference. The study of pairwise preferences even goes back to work by [36], who considered the varying perception of different stimuli.

In machine learning, preferences are at the core of preference learning, which has received increasing attention in recent years [40]. Roughly speaking, the goal in preference learning is to learn (predictive) preference models from preference data. Somewhat surprisingly, and in spite of a close connection between ranking and choice, the problem of learning subset choice functions has received very little attention so far, with only a few notable exceptions [10, 109]. In this paper, we therefore address the problem of learning choice functions, which express preferences in terms of subsets (or equivalently, bipartitions) of $Q$ . From a machine learning point of view, the problem of learning choice functions comes with a number of challenges. For example, while algorithms for supervised learning normally assume inputs in the form of feature vectors of fixed length, the inputs in our setting are neither vectors nor of fixed size. Instead, a choice function is supposed to accept inputs in the form of sets $Q$ of any size, and to return a subset (choice) of the elements as output. In case a set $Q$ is represented by an ordered list of its elements, a choice function thus has to be invariant with respect to permutations of its input.

Not less interestingly, and in fact the key motivation of this paper, choice functions could be context-dependent, in the sense that the preference in favor of an alternative may depend on what other options are available. Context-dependence of this kind has been observed, for example, in marketing studies [26, 13], and has been investigated systematically in fields like economics and psychology. More specifically, three major context effects have been identified in the literature, the compromise effect [103], the attraction effect [52], and the similarity effect [113]:

•

The compromise effect states that the relative utility of an object increases by adding an extreme option that makes it a compromise in the set of alternatives [94]. For instance, consider the set of objects $\{A,B\}$ in Figure 1(a). The ordering of these objects depends on how much the consumer is weighing the quality of the product in relation to its price. If price is the main constraint, then the preference order will be $A\succ B$ . But as soon as another extreme option $C$ becomes available, object $B$ may be considered more favorable, because it represents a compromise between the three alternatives. Thus, the preference relation between $A$ and $B$ might get inverted and turned into $B\succ A$ .

•

Figure 1(b) illustrates the attraction effect. Here, if we add another object $C$ to the set of objects $\{A,B\}$ , where $C$ is slightly dominated by $B$ , the relative utility share for object $B$ increases with respect to $A$ . The major psychological reason is that consumers have a strong preference for dominating products [52]. Thus, the preference relation between $A$ and $B$ may again be influenced.

•

The similarity or substitution effect is another phenomenon, according to which the presence of similar objects tends to reduce the overall probability of an object to be chosen, as it will divide the loyalty of potential consumers [52]. In Figure 1(c), $B$ and $C$ are two similar objects. Consumers who prefer high quality will be divided amongst the two objects, resulting in a decrease of the relative utility share of object $B$ . Again, this may lead to turning a preference $B\succ A$ into $A\succ B$ , at least on an aggregate (population) level, if preferences are defined on the basis of choice probabilities.

Context-dependence as explained above has received only limited consideration in the machine learning literature until recently [22, 85, 10, 101, 95, 15, 63].

Additionally, the context effects discussed so far focus on effects that have been observed for humans, but ignore that the space of (subset) choice functions and thus the number of possible applications is much larger. Many algorithmic problems can be framed as a choice problem, e. g., in the Knapsack problem one is tasked in choosing a set of maximal utility while obeying capacity constraints. Computing the medoid of a set of points (i. e., the point with minimal distance to each other point) is a singleton choice problem. It is clear that these problems cannot be solved by considering each choice alternative individually, but the complete choice context needs to be incorporated. In practice, there are many abstract choice problems similar to these, e. g., portfolio selection [72], algorithm selection [91, 14] and team selection [119] just to name a few. All these problems have in common, that the context-dependence naturally arises because the output depends jointly on all objects in the set and not because a decision maker behaves rationally or irrationally.

Motivated by its practical relevance, we formalize the problem of learning context-dependent choice functions. To this end, we provide a formal definition of such functions and propose a data-generating process consisting of two stages: First, choice alternatives are scored in terms of latent utility degrees, and then, a choice set is determined on the basis of these scores (Section 3). Based on this model, we propose two representations of the latent (context-dependent) utility, called First Evaluate Then Aggregate (FETA) and First Aggregate Then Evaluate (FATE), which have appealing properties from a learning point of view (Section 4), as well as realizations of these models in terms of neural network architectures (Section 5). Thanks to these architectures, called FETA-Net and FATE-Net [85], we are able to learn subset choices on sets of objects in an end-to-end trainable manner. To demonstrate the performance of both networks, we present extensive empirical evaluations on both synthetic and real-world choice datasets (Section 6). Additional information and supplementary material is provided in an appendix, to which we will refer occasionally.

2 Related Literature

The problem of how to model preferences in general has been extensively studied from different viewpoints in the past. From an axiomatic/normative perspective, one posits which properties have to hold for preferences to be considered “rational,” and studies consequences of these properties. Luce’s choice axiom was introduced in \citeyearluce1959 by [67] and requires that the preference between two items does not depend on the presence or absence of any other choice alternative, a property commonly referred to as independence of irrelevant alternatives (IIA). The set of objects from which a particular preference is observed is also called the context [77, 20, 94], and thus preferences obeying IIA are also called context-independent [60]. In the same year, [27, pp. 56f ] proved the ordinal representation theorem, which shows that preferences can be represented by a continuous utility function, if certain conditions including transitivity are assumed to hold. A related line of research was concerned with the concept of revealed preferences, for which most axioms can be reduced to some notion of transitivity [98, 50, 100].

On the other side of the spectrum, observational studies in economics and psychology were more concerned with how humans actually behave, and studied how the observed behavior deviates from IIA [114, 113, 51, 52, 103, 82, 104, 102, 115, 30, 83, 99]. It consistently was observed that choice behavior depended on the specific collection of alternatives available, the context of the choice. [94] and [92] provide an extensive overview of the different context effects which were identified over the years and which we already showcased in the introduction. This motivated researchers to come up with methods able to model these violations. Classical random utility models, like the multinomial logit (MNL) model, are not able to take these effects into account. Therefore, extensions of RUMs were proposed, which are able to capture the compromise and attraction effect [115, 61, 80], the similarity effect [113, 56] or all of the above [94]. One important line of research focuses on the assumption that the decision maker chooses based on multiple utility functions (so called “multiple selfs”, or “multi-self” for short), which are suitably aggregated. This setting has been studied in economics [73, 55, 61, 39, 71, 45] and psychology [113, 102, 115]. Continuing this line of research, [5] show that by utilizing a collection of context-independent utility functions, combined with a suitable aggregation, one is able to model arbitrary choice functions. That is, choice behavior across multiple sets can be modelled even though it might violate context-independence.

While traditional research on preferences, as discussed above, is mostly of a normative, prescriptive or descriptive nature, the advent of machine learning triggered a shift towards “predictive” models. [95] build on ideas of the multi-self literature and propose to learn set-dependent weights and embeddings, which are then linearly combined to arrive at an aggregated score for each object. [10] consider the problem of learning preferences in the form of subsets of objects. To this end, they extend the classical multinomial logit model to account for violations of context-independence. Higher-order interactions between objects are added specifically for those subsets that cause a violation. The set of objects for which choices or choice sets are observed is assumed to be fixed. Therefore, the approach cannot be used for arbitrary task sets, where it can happen that an object is only observed once. Our approach to decompose a context-dependent utility function into an aggregation across smaller sub-contexts has been a recent, promising direction in studying choices [85, 101], and will be the focus of this paper.

Decomposition approaches have also been employed in the related field of “learning to rank”. [2] employ a context-independent model to pre-sort the objects, while a recurrent neural network is used in a subsequent step to fine-tune the ranking. The FATE approach, introduced in the context of choice by [85], obviates the need to pre-sort the objects, by directly embedding each object to produce a representation for each set of objects (aggregation), which is then used as the context to produce the final ranking (evaluation). The authors also introduce an algorithm where this order is swapped, called FETA, in which each object is scored in the context of another object first, and only then the scores are aggregated to produce a final ranking. [3] later consider a similar decomposition, where higher order interactions are approximated by employing sampling.

3 A Probabilistic Model of Choice

We start by establishing the necessary notation (refer to Appendix A for an overview). Throughout this paper, $\llbracket A\rrbracket$ is defined to be $1$ if $A$ is a true statement, and [math] otherwise. We will denote by $\mathcal{X}\subset\mathbb{R}^{d}$ a set of reference objects serving as choice alternatives, which, for simplicity, we assume to be finite (albeit of arbitrary size), if not explicitly stated otherwise. An object or item $\boldsymbol{x}\in\mathcal{X}$ is represented by a vector of features $\boldsymbol{x}=(x_{1},\dots,x_{d})\in\mathcal{X}$ . A non-empty subset $\mathcal{Q}$ of $2^{\mathcal{X}}\setminus\{\emptyset\}$ is called a choice task space if $\emptyset\not\in\mathcal{Q}\not=\emptyset$ and any $Q\in\mathcal{Q}$ is called a choice task. A choice for $Q\in\mathcal{Q}$ is a non-empty subset of $Q$ and the set $\mathcal{C}\coloneqq\bigcup_{Q\in\mathcal{Q}}2^{Q}\setminus\{\emptyset\}$ of choices for any $Q\in\mathcal{Q}$ is called the choice space.

We say that a function $c\colon\mathcal{Q}\longrightarrow\mathcal{C}$ is a (subset) choice function (for $\mathcal{Q}$ ) if $c(Q)\subseteq Q$ is fulfilled for any $Q\in\mathcal{Q}$ , and in case $|c(Q)|=1$ holds for any $Q\in\mathcal{Q}$ , $c$ is called a singleton choice function (for $\mathcal{Q}$ ). A typical example for a real-world singleton choice function is when a user enters a query in a search engine and receives a list of results ( $Q$ ) of which they pick one and click on. Subset choice functions usually occur, when a diverse set of objects is sought, e. g., a search engine decides on a set of the most relevant, but diverse, results to display to the user.

As common in machine learning, the input-output dependency of interest, in our case between tasks and choices, is not assumed to be deterministic. Instead, we assume a probabilistic dependence, which is captured by a (conditional) probability distribution $\operatorname{\mathit{p}}(\cdot\mid Q)$ on the non-empty subsets of $Q$ for every $Q\in\mathcal{Q}$ . Here, $p(C\mid Q)$ is interpreted as the probability to observe the choice $C$ given the task $Q$ . For the sake of convenience, we suppose w.l.o.g. $\operatorname{\mathit{p}}(\cdot\mid Q)$ to be extended to $\mathcal{C}$ via $\operatorname{\mathit{p}}(C\mid Q)\coloneqq 0$ for any $C\in\mathcal{C}\setminus 2^{Q}$ . Moreover, we write for short $\operatorname{\mathit{p}}(\boldsymbol{x}\mid Q)$ for $p(\{\boldsymbol{x}\}\mid Q)$ . In case $\operatorname{\mathit{p}}(Q)$ is the latent probability that $Q\in\mathcal{Q}$ is given as task, the whole data-generating process is modelled by the joint distribution

[TABLE]

on $\mathcal{Q}\times\mathcal{C}$ .

We call the choice probabilities context-independent if

[TABLE]

is fulfilled for every $Q,Q^{\prime}\in\mathcal{Q}$ and any $C,C^{\prime}\in\mathcal{C}$ with $C,C^{\prime}\subseteq Q\cap Q^{\prime}$ . Conversely, we say that a system of choice distributions is context-dependent, if this equality is violated on at least one pair of $Q,Q^{\prime}\in\mathcal{Q}$ . This definition extends in a straight-forward and consistent way the notion of independence of irrelevant alternatives (IIA) introduced by [6], which was originally only defined for the case of singleton choice, in which $\mathcal{C}$ consists of elements of size one only. We choose to use the more general term of context-(in)dependence, for the simple reason that the notion of “irrelevant” alternatives is rather tailored to the analysis of human choices but less meaningful in our more general setting of arbitrary choice functions.

As an example, consider the knapsack problem, where the goal is to select a set of objects which maximize a certain utility, while obeying capacity constraints. It is clear that the decision on which object to include in the choice set needs to incorporate the complete choice task context, and that one is not able to ascertain the relative choice probability of two alternatives while ignoring all others. As already explained in the introduction, context-independence is often violated in practice. This motivates the development of context-dependent learning methods.

Utility-Based Choices

We propose to model choices as the result of a two-stage process (cf. Figure 2 for an overview), grounding them on the notion of utility: In the first stage, each object in a given task $Q\in\mathcal{Q}$ is assigned a real-valued utility score. Then in the second stage, choices are generated based on these scores.

Utility theory has a long history in economics [120, 25, 73]. Originally introduced as a way to measure the satisfaction achieved by a certain alternative [11], it is nowadays common in decision theory to consider utility more as an abstract value that ought to be maximized by any rational decision maker [120, 96]. This is formalized by means of a generalized utility function (for $\mathcal{Q}$ )

[TABLE]

which allows for modelling the utility of an object as a function of both, properties of the object itself as well as properties of other choice alternatives in $Q$ , which constitute the context in which $\boldsymbol{x}$ is considered: $U(\boldsymbol{x},Q)$ expresses a degree of utility of $\boldsymbol{x}$ in the context $Q$ , i. e., given the availability of other choice alternatives $\boldsymbol{x^{\prime}}\in Q\setminus\{\boldsymbol{x}\}$ . The score $U(\boldsymbol{x},Q)$ is supposed to capture an abstract notion of utility, which in turn reflects the propensity of $\boldsymbol{x}$ to be chosen in any task $Q$ .

We call a utility function context-independent in case $U(\boldsymbol{x},Q)=U(\boldsymbol{x},Q^{\prime})$ holds for any $Q,Q^{\prime}\in\mathcal{Q}$ with $\boldsymbol{x}\in Q\cap Q^{\prime}$ and context-dependent otherwise. Via abbreviating $U(\boldsymbol{x})\coloneqq U(\boldsymbol{x},Q)$ for some arbitrary $Q\in\mathcal{Q}$ with $\boldsymbol{x}\in Q$ , any context-independent utility function may be thought of as a function $U:\mathcal{X}\longrightarrow\mathbb{R}$ .

Moving on to the second stage, based on a utility function $U$ , one may define in a deterministic manner for $Q\in\mathcal{Q}$ the corresponding singleton choice as

[TABLE]

and for $t\in\mathbb{R}$ the subset choice (with threshold $t$ ) as

[TABLE]

Clearly, $C_{\mathrm{singleton}}(U,\cdot)$ and $C_{\mathrm{subset}}^{t}(U,\cdot)$ are in fact choice functions and in case $\boldsymbol{x}\mapsto U(\boldsymbol{x},Q)$ is injective (i. e., there are no ties), for any $Q\in\mathcal{Q}$ , the former one is a singleton choice function. There is an interesting connection to social choice theory, where a social choice rule is employed to select an outcome out of a set of possible outcomes in order to maximize some notion of utility for a population of individuals with possibly varying utility functions. The injectivity of such a social choice rule is called resoluteness and it is an important property considered in social choice theory, where it also plays a role in several impossibility results [59, 81]. The singleton choice is a special case of the more general top- $k$ choice, where the goal is to select the $k$ best objects. It differs from subset choice in so far that the size of the choice sets is always fixed, whereas in subset choice it can vary. The top- $k$ choice setting has strong connections to the ranking setting, which we will discuss below.

Further note that using thresholding to convert a set of scores into a partition is a standard approach in multi-label classification [64] and multi-criteria sorting [4].

In the probabilistic setting, the utility function $U$ may serve to model probabilistic choices $p(\cdot\mid Q)$ , $Q\in\mathcal{Q}$ , on $\mathcal{C}$ by using the utility scores as the corresponding parameters of the distributions. Certainly, there are various ways in which this idea could be realized:

Singleton choice

In the case of singleton choice, a natural assumption is the multinomial logit (MNL) model, in which for any $Q\in\mathcal{Q}$ and $\boldsymbol{x}\in Q$ ,

[TABLE]

and $p(C\mid Q)=0$ for any $C\in 2^{Q}$ of size $\geq 2$ [12, 46, 24, 70, 108]. Note here that these choice probabilities are context-independent, if $U$ is context-independent. An important special case is the Bradley-Terry-Luce model [16], which only considers pairwise comparisons (i. e., $|Q|=2$ for all $Q\in\mathcal{Q}$ ).

Subset choice

For the choice of arbitrary subsets (not limited to singleton sets), a simple model is obtained by treating the inclusion or exclusion of each object $\boldsymbol{x}$ in a task $Q$ as independent given the utilities. This results in the distributions $\operatorname{\mathit{p}}(\cdot\mid Q)$ given by

[TABLE]

for any non-empty $C\in 2^{Q}$ and $Q\in\mathcal{Q}$ , where $\gamma(U,Q)$ is a constant such that $\sum_{C\in 2^{Q}\setminus\{\emptyset\}}\operatorname{\mathit{p}}(C\mid Q)=1$ holds. If $U$ is context-independent, the quantity

[TABLE]

does not depend on $Q$ , and thus the choice probabilities $\operatorname{\mathit{p}}(C\mid Q)$ are context-independent as well.

Choices based on rankings

Yet another type of model is obtained by assuming that, based on the latent utilities $U(\boldsymbol{x},Q)$ , $\boldsymbol{x}\in Q$ , a ranking $\pi$ on $Q$ is sampled first and then turned into a choice set via a (possibly probabilistic) procedure $g\colon\pi\mapsto g(\pi)\in 2^{Q}$ afterwards. The probability $\operatorname{\mathit{p}}(C\mid Q)$ is then simply the probability that this procedure results in the output $C$ , i. e.,

[TABLE]

where the sum is taken over all possible rankings $\pi$ over $Q$ . An approach of that kind might be appealing, because probability distributions on rankings have been studied quite thoroughly in the literature. Important families of ranking distributions include distance-based ranking models [37], of which the Mallows model [69] is a popular instance, and multistage ranking models [38], most prominently represented by the Plackett-Luce distribution [86]. An important special case for $g$ is top- $k$ choice, where the first $k$ objects are chosen deterministically (i. e., $g(\pi)=\{\pi^{-1}(1),\dots,\pi^{-1}(k)\}\subseteq Q$ holds with probability $1$ for any ranking $\pi\colon Q\longrightarrow\mathbb{N}$ ). This can be generalized, for example, by assuming that the size $k$ is not fixed but random. An even more general model has recently been proposed by [34], where choices are not necessarily restricted to top- $k$ sets.

In this paper, we are mainly interested in tackling the problem of learning context-dependent choice functions from training data. The performance of a particular hypothesis, i. e., a choice function $c\colon\mathcal{Q}\longrightarrow\mathcal{C}$ , is measured by an appropriate loss function (see Section 4). In Section 6.2 we go into more detail on how to derive suitable loss functions from (5) and (6). After having introduced suitable models for utility-based choices, we now turn to the problem of representing context-dependent choice functions.

4 Learning Context-Dependent Choice Functions

Our main interest in this paper is to tackle choice from a machine learning perspective. More specifically, we seek to induce a predictive choice function $c\colon\mathcal{Q}\longrightarrow\mathcal{C}$ from training data $\mathcal{D}=\{(Q_{i},C_{i})\}_{i=1}^{N}\subset\mathcal{Q}\times\mathcal{C}$ in the form of exemplary tasks $Q_{i}$ together with observed choices $C_{i}\in 2^{Q_{i}}$ . The performance of such a function is measured in terms of its expected loss (risk)

[TABLE]

where $L:\,\mathcal{C}\times\mathcal{C}\longrightarrow\mathbb{R}$ is a loss function (cf. Section 6.2 for an overview of the loss functions we consider), and $\operatorname{\mathit{p}}$ the probability measure associated with the distribution (1), i. e., the underlying data-generating process modelling the probability of observing tasks $Q$ together with choices $C$ . The Bayes predictor $c^{*}$ assigns each task $Q$ the respective loss minimizer

[TABLE]

Since $\operatorname{\mathit{p}}(Q,C)$ is usually unknown, one therefore opts to minimize the empirical risk

[TABLE]

on the given data $\mathcal{D}$ instead.

Assuming the data to be generated according to one of (3)–(7) (known to the learner) and by means of an (unknown) latent utility function (2), this loss minimization problem essentially comes down to learning the generalized utility function (2). This function, while allowing one to model context-dependence, causes several practical problems, mainly because its second argument, $Q$ , is a set of variable size.

Many machine learning models such as neural networks or support vector machines require data to be given in the form of a feature vector $\boldsymbol{x}\in\mathbb{R}^{m}$ . Hence, in order to apply such a model for learning a utility function $U\colon\{(\boldsymbol{x},Q):\boldsymbol{x}\in Q\in\mathcal{Q}\}\longrightarrow\mathbb{R}$ , we have to fix an injective feature transformation $\Psi\colon\mathcal{Q}\longrightarrow\mathbb{R}^{m}$ .

We choose to represent $Q=\{\boldsymbol{y}_{1},\dots,\boldsymbol{y}_{k}\}\in\mathcal{Q}\subset\mathbb{R}^{d}$ by the vector $(\boldsymbol{y}_{1},\dots,\boldsymbol{y}_{k})\in\mathbb{R}^{kd}$ . Of course, this does only define a valid transformation $\Psi$ in case $|Q|$ is the same for each $Q\in\mathcal{Q}$ . Assuming this to be the case, we may consider a utility function $U\colon\{(\boldsymbol{x},Q):\boldsymbol{x}\in Q\in\mathcal{Q}\}\longrightarrow\mathbb{R}$ as a function $\mathbb{R}^{(k+1)d}\longrightarrow\mathbb{R}$ . Noticing that ${Q=\{\boldsymbol{x}_{\sigma(i)}:i\in[k]\}}$ holds for any bijection $\sigma\colon[k]\longrightarrow[k]$ , this function should necessarily be permutation-invariant or symmetric in the sense that

[TABLE]

for each permutation $\sigma$ on $[k]$ [106].

The utility choice models proposed below will enforce this property and are also capable of dealing with tasks of different sizes. More specifically, we present two general decompositions, which are able to approximate a generalized latent utility function (2). Section 4.1 describes FETA, which decomposes (2) into first- and second-order (or, more generally, higher order) utility functions and aggregates the corresponding scores into an overall utility score. The FATE approach (Section 4.2), on the other hand, first computes an embedding of the complete object context $Q$ in a space of fixed dimensionality, and evaluates the utility of each object in that space. The former could be advantageous for datasets, of which the choice task contexts can be expressed through local interactions, while the latter is useful, if the set of objects as a whole can be summarized by suitable global properties (e. g., choosing that element of a set, which is closest to the centroid of all elements in this set).

4.1 First Evaluate Then Aggregate

Recall that the overall objective is to model the context-dependent utility function (2), i. e., the utility of each object should not only depend on object attributes, but also on the choice task $Q$ . One way of handling the problem of rating objects in contexts of variable size is to decompose a context into sub-contexts of a fixed size $k$ [85, 101]. More specifically, the idea is to learn sub-utility functions $U_{0},\dots,U_{K}$ of the form $U_{0}:\mathcal{X}\longrightarrow\mathbb{R}$ and

[TABLE]

for $1\leq k\leq K\leq|Q|$ , and represent the original function (2) as an aggregation

[TABLE]

where $\overline{U}_{k}(\boldsymbol{x},Q)$ is the average over the values $U_{k}(\boldsymbol{x},Q^{\prime})$ for subsets $Q^{\prime}$ of $Q\setminus\{\boldsymbol{x}\}$ consisting of $k$ distinct elements, i. e., formally

[TABLE]

Note, that the sum is taken w.r.t. to all $k$ -sized subsets $Q^{\prime}$ of $Q\setminus\{\boldsymbol{x}\}$ , potentially including some in $2^{\mathcal{X}}\setminus\mathcal{Q}$ . Here, $U_{k}(\boldsymbol{x},Q)$ may be thought of as a measure to which extent an item $\boldsymbol{x}$ is preferred to the elements of $Q$ , and $\overline{U}_{k}(\boldsymbol{x},Q)$ as an indicator of how much $\boldsymbol{x}$ is on average preferred to $k$ distinct elements from $Q\setminus\{\boldsymbol{x}\}$ . We refer to this approach as First Evaluate Then Aggregate (FETA), because an alternative is first evaluated in each sub-context, and these evaluations are then aggregated. Accordingly, we call $U$ defined in (10) the FETA utility function with sub-utility functions $U_{0},\dots,U_{K}$ and denote it by $U_{\text{FETA}}^{U_{0},\dots,U_{K}}$ .

[7] propose a related expansion in the context of market share modelling. [101] call it an instantiation of the universal logit model, since it can be seen as a generalization of the multinomial logit model (5), when conditioning on the task $Q$ .

Roughly speaking, the motivation behind the above decomposition is that dependencies and interaction effects between objects should only occur up to a certain order $K+1$ , or at least can be limited to this order without losing too much information. To see what we mean by “order” in this context, observe that the first order model ( $K=0$ ) reduces to $U_{0}(\boldsymbol{x})$ and thus only models the inherent utility of each object. A second order model ( $K=1$ ) then introduces pairwise terms. This is an assumption that is commonly made in the literature on aggregation functions [44]. The reason why the utilities are averaged for a fixed $k$ , but summed across different $k$ , is to give each order equal weight. This prevents the utility from being dominated by higher-order interactions. Furthermore, it allows the sub-utility functions to output scores in roughly the same scale, which is advantageous when the model is applied to choice tasks $Q$ of varying size.

Given the models of context-dependent choices as outlined above, the learning problem essentially comes down to learning the utility function (10) of order $K+1$ . From this function, one can then derive the utility function (2), which in turn allows for deriving predictions of choices via the choice functions discussed before.

In this paper, we realize (10) for the special case $K=1$ , which can be seen as a second-order approximation of a context-dependent utility function. Thus, we propose the representation of a choice function $c$ based on a latent sub-utility function $U_{0}:\mathcal{X}\longrightarrow\mathbb{R}$ and a pairwise function $U_{1}:D_{1}\longrightarrow\mathbb{R}$ . In this way, the FETA utility function with sub-utility functions $U_{0},U_{1}$ may be written as

[TABLE]

The value $U_{0}(\boldsymbol{x})$ can be seen as a kind of inherent, context-independent utility of $\boldsymbol{x}$ , whereas the scores $U_{1}(\boldsymbol{x},\{\boldsymbol{y}\})$ , $\boldsymbol{y}\in Q\setminus\{\boldsymbol{x}\}$ , serve as “corrections” of this utility in the context of the task $Q$ .

[101] propose a similar approximation, but instead of averaging the task context, the authors simply sum up all utilities and impose sum-to-zero constraints to guarantee identifiability.

As for the FETA model $U_{\mathrm{FETA}}^{(U_{0},U_{1})}$ , we will now see that it is identifiable up to the choice of $U_{0}$ .

Proposition 4.2.

Suppose $|\mathcal{X}|\geq 4$ and $\mathcal{Q}$ to be such that for any distinct $\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}\in\mathcal{X}$ there is some $Q\in\mathcal{Q}$ with $\{\boldsymbol{x},\boldsymbol{y}\}\subseteq Q\not\ni\boldsymbol{z}$ . Let $U_{0},\tilde{U}_{0}\colon\mathcal{X}\longrightarrow\mathbb{R}$ and $U_{1},\tilde{U}_{1}\colon D_{1}\longrightarrow\mathbb{R}$ be arbitrary. Then, we have $U_{\mathrm{FETA}}^{(U_{0},U_{1})}=U_{\mathrm{FETA}}^{(\tilde{U}_{0},\tilde{U}_{1})}$ if and only if

[TABLE]

Proof.

$\Leftarrow$ is clear. For proving the remaining implication $\Rightarrow$ , suppose that $U_{\mathrm{FETA}}^{(U_{0},U_{1})}=U_{\mathrm{FETA}}^{(\tilde{U}_{0},\tilde{U}_{1})}$ .

Claim 4.2.1.

For any distinct $\boldsymbol{x},\boldsymbol{y},\boldsymbol{z}\in\mathcal{X}$ we have

[TABLE]

Proof.

For arbitrary $Q,Q^{\prime}\subseteq\mathcal{X}$ with $|Q|=|Q^{\prime}|$ , $\{\boldsymbol{x},\boldsymbol{y}\}\subseteq Q\not\ni\boldsymbol{z}$ and $\{\boldsymbol{x},\boldsymbol{z}\}\subseteq Q^{\prime}\not\ni\boldsymbol{y}$ we have

[TABLE]

Since this holds for arbitrary $(U_{0},U_{1})$ (and thus also for $(\tilde{U}_{0},\tilde{U}_{1})$ ), 4.2.1 follows. $\blacksquare$

Now, let $\boldsymbol{x}_{0}\in\mathcal{X}$ be fixed for the moment and define $b\colon\mathcal{X}\longrightarrow\mathbb{R}$ via

[TABLE]

According to 4.2.1 we have for any distinct $\boldsymbol{x},\boldsymbol{y}\in\mathcal{X}\setminus\{\boldsymbol{x}_{0}\}$ the identity

[TABLE]

Moreover, the definition of $b$ assures that $\tilde{U}_{1}(\boldsymbol{x},\{\boldsymbol{x}_{0}\})=U_{1}(\boldsymbol{x},\{\boldsymbol{x}_{0}\})-b(\boldsymbol{x})$ holds for any $\boldsymbol{x}\not=\boldsymbol{x}_{0}$ , i. e., $b$ already fulfills

[TABLE]

For $\boldsymbol{x}\in\mathcal{X}\setminus\{\boldsymbol{x}_{0}\}$ we may choose a query set $Q\subseteq\mathcal{X}\setminus\{\boldsymbol{x}_{0}\}$ and then (12) assures us

[TABLE]

Since $\tilde{U}_{0}(\boldsymbol{x}_{0})=U_{0}(\boldsymbol{x}_{0})+b(\boldsymbol{x}_{0})$ holds by definition of $b$ , we thus have shown

[TABLE]

With regard to (12) it remains to show

[TABLE]

For this, note that the same argumentation as before with $\boldsymbol{x}_{0}$ replaced by some arbitrary $\boldsymbol{x}_{1}\in\mathcal{X}\setminus\{\boldsymbol{x}_{0}\}$ shows us that $b$ also fulfills (12) with $\boldsymbol{x}_{0}$ replaced by $\boldsymbol{x}_{1}$ . In particular, (14) holds. Combining (12), (13) and (14) completes the proof.

$\blacksquare$

Corollary 4.3.

*Suppose $\mathcal{X}$ and $\mathcal{Q}$ are as in Proposition 4.2 and let $U_{0}\colon\mathcal{X}\longrightarrow\mathbb{R}$ be fixed. Then, the mapping $U_{1}\mapsto U_{\mathrm{FETA}}^{(U_{0},U_{1})}$ is injective. *

Another interesting theoretical question concerns the expressiveness of the FETA decomposition: Which predictors $c\colon\mathcal{Q}\longrightarrow\mathcal{C}$ can be represented by FETA? The following result shows that the decomposition into pairwise utilities (11) is indeed a restriction, in the sense that it does not allow for representing the entire class of predictors in case $|\mathcal{X}|\geq 7$ .

Proposition 4.4.

If $|\mathcal{X}|\geq 7$ , not every singleton choice function on $\mathcal{X}$ can be expressed via the second order FETA model. More precisely: For distinct $\boldsymbol{a},\boldsymbol{b},\boldsymbol{c},\boldsymbol{x},\boldsymbol{y},\boldsymbol{x^{\prime}},\boldsymbol{y^{\prime}}\in\mathcal{X}$ there do not exist sub-utility functions $U_{0}\colon\mathcal{X}\longrightarrow\mathbb{R}$ , $U_{1}\colon D_{1}\longrightarrow\mathbb{R}$ and $t\in\mathbb{R}$ such that the choice function $c\colon\mathcal{Q}\longrightarrow\mathcal{C}$ defined either via $c(\cdot)\coloneqq C_{\mathrm{singleton}}(U_{\mathrm{FETA}}^{U_{0},U_{1}},\cdot)$ or via $c(\cdot)\coloneqq C_{\mathrm{subset}}^{t}(U_{\mathrm{FETA}}^{U_{0},U_{1}},\cdot)$ fulfills

[TABLE]

Proof.

We prove the statement indirectly. To this end, fix distinct $\boldsymbol{a}$ , $\boldsymbol{b}$ , $\boldsymbol{c}$ , $\boldsymbol{x}$ , $\boldsymbol{y}$ , $\boldsymbol{x^{\prime}}$ , $\boldsymbol{y^{\prime}}\in\mathcal{X}$ and assume there were some $U_{0},U_{1}$ and $t\in\mathbb{R}$ such that $c$ defined either via $c(\cdot)\coloneqq C_{\mathrm{singleton}}(U_{\mathrm{FETA}}^{U_{0},U_{1}},\cdot)$ or via $c(\cdot)\coloneqq C_{\mathrm{subset}}^{t}(U_{\mathrm{FETA}}^{U_{0},U_{1}},\cdot)$ fulfills both (15) and (16). With the convenient abbreviations $u_{\boldsymbol{r}}\coloneqq U_{0}(\boldsymbol{r})$ and $u_{\boldsymbol{r},\boldsymbol{s}}\coloneqq U_{1}(\boldsymbol{r},\{\boldsymbol{s}\})$ , the following constraints for (11) immediately follow from (15):

[TABLE]

Summing up the first two inequalities and then applying the third one yields

[TABLE]

from which we obtain via subtracting common terms

[TABLE]

Exactly the same argumentation (with the roles of $\boldsymbol{a}$ and $\boldsymbol{b}$ interchanged and $\boldsymbol{x}$ resp. $\boldsymbol{y}$ replaced by $\boldsymbol{x^{\prime}}$ resp. $\boldsymbol{y^{\prime}}$ ) lets us infer from (16)

[TABLE]

which contradicts (17). This completes the proof. $\blacksquare$

Note that a limited expressivity should not necessarily be seen as a negative property. In particular, from a machine learning perspective, an overly excessive expressivity (or capacity of the underlying hypothesis space) is connected with the practical problem of poor generalization due to overfitting, i. e., being overly expressive may prevent the learner from identifying the right model. In any case, we expect FETA to work well for all choice functions that (approximately) decompose into a pairwise relation between objects. Naturally, this leads to the question whether it is possible to incorporate more of the set-based context without ultimately increasing computational complexity. This question motivated our next decomposition.

4.2 First Aggregate Then Evaluate

To deal with the problem of task contexts of variable size, our previous approach was to decompose the context into sub-contexts of a fixed size, evaluate an object $\boldsymbol{x}$ in each of the sub-contexts, and then aggregate these evaluations into an overall assessment. An alternative to this FETA strategy, and in a sense contrariwise approach, consists of first aggregating the task into a representation of fixed size, and then evaluating the object $\boldsymbol{x}$ in the presence of this task representative.

More specifically, the FATE approach requires a mapping $\phi$ from $\mathcal{X}$ to some $m$ -dimensional embedding space $\mathcal{Z}\subseteq\mathbb{R}^{m}$ as well as a context-dependent sub-utility function $U^{\prime}\colon\mathcal{X}\times\mathcal{Z}\longrightarrow\mathbb{R}$ . To evaluate an object $\boldsymbol{x}$ in a choice task $Q\in\mathcal{Q}$ , the FATE strategy first computes $\frac{1}{|Q|}\sum_{\boldsymbol{y}\in Q}\phi(\boldsymbol{y})$ as representative for the task and then evaluates it via $U^{\prime}$ as

[TABLE]

We call this $U$ the FATE utility function with sub-utility function $U^{\prime}$ and transformation $\phi$ and denote it by $U_{\text{FATE}}^{U^{\prime},\phi}$ .

This approach is related to recent advances on dealing with set-valued inputs in neural networks [126, 90, 8], where a permutation-equivariant network directly maps from sets of objects to scores. [95] propose to learn set-dependent aggregation functions with an inductive bias towards principles from behavioral choice theory. They note that general models like Deep Sets [126], which try to approximate set functions using a permutation-invariant neural network, are overly general, because they have a high violation capacity, i. e., the flexibility of the model to change its choices, when objects are removed from the choice task. The FATE approach on the other hand first condenses the task context into a representative and only then scores each object. The resulting model has an inductive bias that favors functions for which the object utility depends on such a set-global reference object. This could be advantageous for datasets where the set of objects as a whole can be summarized by suitable global properties (e. g., choosing that element from a set, which is closest to the centroid of all elements in the set), such that the task to score the objects with this context becomes easy. FETA on the other hand, incorporates task-information through local interactions.

Without further assumptions on $\phi$ and $U^{\prime}$ , this model is able to express any possible choice function $c$ on $\mathcal{Q}$ , as we show in the following. The proof of the upcoming result is similar to the proof of Theorem 2 by [126].

Proposition 4.6.

Suppose $\mathcal{X}$ to be countable and $\mathcal{Q}\subseteq\{Q\subseteq\mathcal{X}:|Q|<\infty\}$ . There exists a parametrization $\phi\colon\mathcal{X}\longrightarrow\mathbb{R}$ with the following property:

(i)

For any singleton choice function $c$ on $\mathcal{Q}$ , there is a utility function $U_{c}^{\prime}\colon\mathcal{X}\times\mathbb{R}\longrightarrow\mathbb{R}$ such that $C_{\mathrm{singleton}}(U_{\mathrm{FATE}}^{U^{\prime}_{c},\phi},Q)=c(Q)$ holds for any $Q\in\mathcal{Q}$ .

(ii)

For any subset choice function $c$ on $\mathcal{Q}$ there exists a utility function $U^{\prime}_{c}\colon\mathcal{X}\times\mathbb{R}\longrightarrow\mathbb{R}$ with $C_{\mathrm{subset}}^{1/2}(U_{\mathrm{FATE}}^{U^{\prime}_{c},\phi},Q)=c(Q)$ for any $Q\in\mathcal{Q}$ .

Proof.

Since $\mathcal{X}$ is countable, there exists an injective function $\delta\colon\mathcal{X}\longrightarrow\mathbb{N}$ . For $\boldsymbol{x}\in\mathcal{X}$ define

[TABLE]

wherein $p_{i}\in\mathbb{N}$ denotes the $i$ -th prime number for any $i\in\mathbb{N}$ . Before proving (i) and (ii), we show that the mapping

[TABLE]

is injective. For this, let $Q,Q^{\prime}\in\mathcal{Q}$ with $\Phi(Q)=\Phi(Q^{\prime})$ . Then,

[TABLE]

holds for the integers $a\coloneqq\prod\nolimits_{\boldsymbol{x}\in Q^{\prime}}p_{\delta(\boldsymbol{x})}$ and $b\coloneqq\prod\nolimits_{\boldsymbol{x}\in Q}p_{\delta(\boldsymbol{x})}$ , i. e., $a^{|Q|}=b^{|Q^{\prime}|}$ . As $a$ and $b$ are both products of distinct primes, the uniqueness of the prime factorization lets us infer $a=b$ and thus also $Q=Q^{\prime}$ .

We proceed with proving (i) and (ii) simultaneously. For this, suppose any choice function $c$ on $\mathcal{Q}$ to be fixed. Since $\Phi$ from above is injective, there exists a mapping $\Psi\colon\mathbb{R}\longrightarrow\mathcal{Q}$ such that $\Psi(\frac{1}{|Q|}\sum_{\boldsymbol{x}\in\mathcal{Q}}\phi(\boldsymbol{x}))=Q$ holds for any $Q\in\mathcal{Q}$ . Note, that $\Psi\mathord{\upharpoonright}_{\Phi(\mathcal{Q})}$ is the inverse function of $\Phi$ . Thus, the claim follows with the choice

[TABLE]

$\blacksquare$

Although this expressivity is desirable in general, it comes at a cost. The FATE model $U_{\mathrm{FATE}}^{(\phi,U^{\prime})}$ as such is not identifiable: For example, suppose $U^{\prime}\colon\mathcal{X}\times\mathcal{Z}\longrightarrow\mathbb{R}$ is of the form $U^{\prime}(\boldsymbol{x},\boldsymbol{z})\coloneqq f(\boldsymbol{x})+\lVert z\rVert_{2}$ for some function $f\colon\mathcal{X}\longrightarrow\mathbb{R}$ , where $\lVert\cdot\rVert_{2}$ denotes the standard euclidean norm in $\mathbb{R}^{d}\supseteq\mathcal{Z}$ . For arbitrary $\phi_{1}\colon\mathcal{X}\longrightarrow\mathcal{Z}$ , we obtain with $\phi_{2}\coloneqq-\phi_{1}$ that

[TABLE]

for any $Q\in\mathcal{Q},\boldsymbol{x}\in Q\subseteq\mathcal{X}$ , i. e., $U_{\mathrm{FATE}}^{(\phi_{1},U^{\prime})}=U_{\mathrm{FATE}}^{(\phi_{2},U^{\prime})}$ holds.

4.3 Linear Sub-Utility Functions

A related question concerns the expressivity of the FATE and FETA approaches, when the underlying sub-utility functions and transformations are linear functions. In case $\phi$ and $U^{\prime}$ are chosen as linear functions in the sense that $\phi(\boldsymbol{x})=\boldsymbol{A}\boldsymbol{x}$ and $U^{\prime}(\boldsymbol{x},\boldsymbol{z})=\boldsymbol{c}^{t}\boldsymbol{x}+\boldsymbol{d}^{t}\boldsymbol{z}$ for any $(\boldsymbol{x},\boldsymbol{z})\in\mathcal{X}\times\mathcal{Z}$ and some $\boldsymbol{A}\in\mathbb{R}^{d\times m}$ , $\boldsymbol{c}\in\mathbb{R}^{d}$ and $\boldsymbol{d}\in\mathbb{R}^{m}$ , (18) takes the form

[TABLE]

As the second summand therein does not depend on $\boldsymbol{x}$ , for any $Q\in\mathcal{Q}$ , the singleton choice $C_{\mathrm{singleton}}(U_{\text{FATE}}^{U^{\prime},\phi},Q)$ is the same as that corresponding to the linear utility function $\boldsymbol{x}\mapsto\boldsymbol{c}^{t}\boldsymbol{x}$ and thus independent of the context $Q$ . Consequently, at least one of $U^{\prime}$ and $\phi$ has to be non-linear in order to model context-dependent choices.

In contrast to this, for the case of FETA, linearity of the sub-utility functions does not imply context-independence of the model: If $U_{0}$ and $U_{1}$ are linear in the sense that $U_{0}(\boldsymbol{x})=\boldsymbol{b}^{t}\boldsymbol{x}$ and $U_{1}(\boldsymbol{x},\{\boldsymbol{y}\})=\boldsymbol{c}^{t}\boldsymbol{x}+\boldsymbol{d}^{t}\boldsymbol{y}$ for any distinct $\boldsymbol{x},\boldsymbol{y}\in\mathcal{X}$ and some weight vectors $\boldsymbol{b},\boldsymbol{c},\boldsymbol{d}\in\mathbb{R}^{d}$ , the FETA utility function with sub-utility functions $U_{0},U_{1}$ is given as

[TABLE]

for any $\boldsymbol{x}\in Q\in\mathcal{Q}$ . As the second summand therein depends not only on $Q$ but also on $\boldsymbol{x}$ , $U$ can in general not be represented as a linear function.

5 Implementation Using Neural Networks

Having defined the decomposition strategies FETA and FATE in the preceding section, we are still missing an algorithm, which can actually learn the utility functions involved. In this section, we propose realizations of the FETA and FATE approaches in terms of neural network architectures FETA-Net and FATE-Net, respectively. Our design goals for both neural networks are twofold. First, they should be end-to-end trainable using (stochastic) gradient descent, such that they can be used as part of a larger neural network architecture. To this end, we ensure that the outputs of the networks are differentiable almost everywhere with respect to the weights. Similarly, the loss functions employed in conjunction with a regularization term for the weights should also be differentiable almost everywhere and convex with respect to the utilities. Second, the architectures should be able to generalize beyond the task sizes encountered in the training data, since in practice it is unreasonable to expect all choice tasks to be of the same size.

5.1 FETA-Net Architecture

We will now describe our first neural network architecture FETA-Net and its training. Recall from Section 4.1 that we seek to predict utility scores $U(\boldsymbol{x}_{i},Q)$ of the form (11) for every object $\boldsymbol{x}_{i}\in Q$ . What we need to learn, therefore, is the functions $U_{0}$ and $U_{1}$ .

In FETA-Net, we do so by means of a deep neural network architecture (shown in Figure 3). The network is trained in a set of data $\mathcal{D}=\{(Q_{i},C_{i})\}_{i=1}^{N}$ , where each $Q_{i}$ is a choice task and $C_{i}\in 2^{Q_{i}}\setminus\{\emptyset\}$ the choice set observed for that task.

The main component is the neural network tasked with learning the pairwise utility function $U_{1}$ (depicted in blue). It receives the feature vectors of two objects $\boldsymbol{x}_{i}$ and $\boldsymbol{x}_{j}$ and outputs a score for $\boldsymbol{x}_{i}$ in the presence of object $\boldsymbol{x}_{j}$ . To build up the complete matrix $R=(r_{i,j})$ would require iterating over all pairs of objects in $Q$ . This is why we choose to adopt the CmpNN approach by [93] for the pairwise scoring function, i. e., instead of one output neuron we utilize two $U_{1}^{+}$ and $U_{1}^{-}$ . Weight sharing ensures that $U_{1}^{+}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})=U_{1}^{-}(\boldsymbol{x}_{j},\boldsymbol{x}_{i})$ and $U_{1}^{-}(\boldsymbol{x}_{i},\boldsymbol{x}_{j})=U_{1}^{+}(\boldsymbol{x}_{j},\boldsymbol{x}_{i})$ holds. For the diagonal, we evaluate a separate network $U_{0}(\boldsymbol{x}_{i})$ , which learns a latent utility component for each object (corresponding to the case $k=0$ in (10)). With that it suffices to iterate over all combinations of objects once, and to construct the matrix $R$ as follows:

[TABLE]

Then, each row of the relation $R$ is averaged to obtain a score $U(\boldsymbol{x}_{i},Q)=r_{i,i}+\frac{1}{|Q|-1}\sum_{1\leq j\neq i\leq|Q|}r_{i,j}$ for each object $\boldsymbol{x}_{i}\in Q$ . Therefore, the network $U_{1}$ is a mapping $\mathbb{R}^{d}\times\mathbb{R}^{d}\longrightarrow\mathbb{R}^{2}$ and $U_{0}$ a mapping $\mathbb{R}^{d}\longrightarrow\mathbb{R}$ which can be instantiated by any neural network architectures suitable for the given objects. For our experiments later on, we shall use deep, densely connected networks. We treat the number of layers and units as hyperparameters and optimize them jointly with all the other hyperparameters.

The complete training algorithm for FETA-Net is shown in Algorithm 1, which is an instantiation of stochastic gradient descent. We will denote the weight vectors of the networks $U_{0}$ and $U_{1}$ by $\theta_{0}$ and $\theta_{1}$ , respectively. In the beginning, these weight vectors are suitably initialized in order to avoid exploding/vanishing gradients [42, 48]. In each epoch, the algorithm shuffles the given dataset and constructs mini-batches $\mathcal{B}_{1},\dots,\mathcal{B}_{T}$ with $\mathcal{B}_{i}\subset\mathcal{D}$ for all $i\in[T]$ . In lines 10 to 18, the pairwise relation is constructed as described above. The utilities $\boldsymbol{u}=(u_{1},\dots u_{|Q|})$ for the objects inside the task $Q$ are computed in line 19 by summing the pairwise relation $r_{i,j}$ across the columns of the matrix. Finally, the loss is computed in line 20 and added to the cumulative loss for the batch. The weight vectors $\theta_{0}$ and $\theta_{1}$ are updated using backpropagation in lines 22–23.

It is easy to see, that the training runtime complexity per epoch (including backpropagation) of FETA-Net is $\mathcal{O}\left(Ndq^{2}\right)$ , where $N$ denotes the number of instances, $d$ is the number of features per object, and $q\coloneqq\max_{(Q,Y)\in\mathcal{D}}|Q|$ is an upper bound on the number of objects in each choice task. For a new task $Q$ , the prediction time is in $\mathcal{O}\left(d|Q|^{2}\right)$ .

5.2 FATE-Net Architecture

The second architecture we propose is called FATE-Net, and the structure for predicting the score for one object is depicted in Figure 4. Inputs are the $n$ objects of the task $Q=\{\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}\}$ (shown in green). Each object is independently passed through a deep, densely connected embedding layer (shown in blue). The embedding layer approximates the function $\phi$ in (18) and is a map $\mathbb{R}^{d}\longrightarrow\mathbb{R}^{d^{\prime}}$ . Note that we employ weight sharing, i. e., the same embedding is used for each object. Then, the representative $\mu_{Q}$ for the task $Q$ is computed by averaging the representations of each object. To calculate the score $U(\boldsymbol{x}_{i},\mu_{Q})$ for an object $\boldsymbol{x}_{i}$ , the feature vector is concatenated with $\mu_{Q}$ to form the input to the final joint neural network layers (here depicted in orange). Again, weight sharing is used to learn only one scoring network. For both neural networks, we treat the number of layers, units and embedding dimensions as hyperparameters, which are to be optimized.

The detailed training algorithm is shown in Algorithm 2. As mentioned before for FETA-Net, it is an instantiation of stochastic gradient descent. We will denote the weight vectors of the networks $U$ and $\phi$ by $\theta_{U}$ and $\theta_{\phi}$ , respectively. The initialization of the weight vectors and the construction of the mini-batches (lines 10–14) is again the same as for FETA-Net. In line 18, the representative object $\mu_{Q}$ is constructed by first mapping each object to the embedding space using $\phi_{\theta_{\phi}}$ , and then computing the centroid of the embedded points. The embedding network can be any network that receives an object and returns a $d^{\prime}$ -dimensional real-valued vector, and should be adapted to the data at hand. The utility scores $\boldsymbol{u}$ are then computed by evaluating each object $\boldsymbol{x}\in Q$ in conjunction with the representative point $\mu_{Q}$ (see line 19). The cumulative loss for the mini batch is updated in line 20. The weight vectors $\theta_{\phi}$ and $\theta_{U}$ are updated by calculating the gradient of the loss using backpropagation and scaling it by an appropriate learning rate (lines 22–21).

The training runtime complexity per epoch of FATE-Net (including backpropagation) is $\mathcal{O}\left(Ndq^{2}\right)$ , where $N$ denotes the number of choice tasks, $d$ is the number of features per object, and $q$ is an upper bound on the number of objects in each task. For a new choice task $Q$ , the prediction can be done in only $\mathcal{O}(d|Q|)$ time (i. e., linear in the number of objects). This is due to the fact that $\mu_{Q}$ only needs to be computed once.

6 Empirical Evaluation

The main goal of our empirical evaluation is to find out for which kind of problems FATE-Net and FETA-Net work well. Moreover, we wish to compare these approaches with existing methods for ranking and choice. In particular, the following questions will be addressed:

•

Are the decompositions FATE and FETA suitable for learning context-dependent choice functions?

•

How important is (i) the complexity/expressiveness of the underlying model class and (ii) its ability to model context-dependent choice functions, and how do these two factors interact? For example, are deep neural networks (i. e. FATE-Net and FETA-Net) really needed, or would a simpler (e. g. linear) model also suffice? Can the additional complexity/expressiveness compensate for the inability to model context-dependent choice functions?

•

To what extent is our approach able to generalize over the task size? For example, is it possible to produce accurate predictions on tasks of a specific size, even if that size has never occurred in the training data?

For the first two questions, we evaluate the approaches on a variety of general choice and singleton choice problems. We also introduce the variant FETA-Linear, which learns the FETA decomposition using only linear functions, to ascertain whether it is able to account for some of the context-effects present in the data.

In addition, we evaluate the performance of different logit models used in economics: multinomial logit (MNL) [75], nested logit (NL) [123], generalized nested logit (GNL) [122] and mixed logit (ML) [110]. The first logit model is the MNL model (referred as GenLinearModel for subset choice task), which assumes that the choice between two objects does not depend on other objects in the set [67]. The NL and GNL belong to the generalized extreme value (GEV) class of models that learn correlations amongst the objects in the given set, which implicitly accounts for some of the context effects, but mainly the similarity effect [9, 113]. GEV models allocate the objects in the given task $Q$ into different sets called nests and learn correlations between the objects inside each nest [122, 110]. These nests are disjoint in case of NL [123]. GNL is the most general model of this class, which allows the fractional allocation of each object in $Q$ to each nest and it learns the correlation between them [122]. ML estimates the choice probability as a mixture of multiple logits [76, 125].

Another model which was proposed for solving the task of singleton choice is the PairwiseSVM, which makes use of induced pairwise preferences to fit a linear model [32, 68].

As a recent context-dependent baseline model, we implement the set-dependent aggregation (SDA) approach by [95]. We also implement the RankNet model as an additional context-independent baseline, which learns a non-linear utility for each object by converting them to pairwise preferences [107, 18]. Due to a lack of algorithms specifically designed for the subset choice problem, we employ the same thresholding of the utilities described in (4) we use for our approaches. The threshold is tuned on a small validation set for all approaches, using the $F_{1}$ -score as target loss (see Appendix C for details).

All in all, we compare to both deep neural networks and linear models, so that we have baselines of varying representative power, which helps to contextualize the performance of our approaches on each dataset. Finally, to answer the third question, we train the different models on a fixed task size and predict on queries of deviating size.

6.1 Setup

All experiments are implemented in Python, and the code and the dataset generators are publicly available111https://github.com/kiudee/cs-ranking. To properly compare all models in a fair and unbiased way, we make sure to optimize the hyperparameters of each model by employing Bayesian optimization in a nested validation loop (we use the Gaussian process based implementation in scikit-optimize [49]) The final out-of-sample estimates are then computed using another outer cross-validation loop with the best hyperparameters found in each fold. The loss functions and the datasets considered throughout our empirical evaluation are introduced in the following two subsections, respectively (see Appendix C for more details).

The experiments were run on a compute cluster with a mix of NVIDIA GTX 1080 Ti and RTX 2080 Ti GPUs (on average 15-20) and Intel Xeon E5-2670 processors. One job consisting of one outer split with complete hyperparameter optimization on the validation set took on average $8$ hours. The training of FATE-Net and FETA-Net on average (across datasets) required $11$ hours. Combined, all experiments took roughly $11\,400$ GPU hours and $6000$ CPU hours.

6.2 Loss Functions

As explained in Section 4, our goal during learning is to minimize a suitable target loss $L\colon\mathcal{C}\times\mathcal{C}\longrightarrow\mathbb{R}$ . This is usually the loss one is interested in minimizing, e. g., the $F_{1}$ -measure in our case. Since these losses are usually not differentiable, they cannot readily be used in a gradient descent algorithm. Therefore, during training we opt to minimize surrogate losses which are differentiable almost everywhere instead. In this section, we will first introduce the target losses we consider (cf. Section 6.2.1). We then derive surrogate losses based on the probabilistic choice models introduced in Section 3 and based on practical considerations (cf. Section 6.2.2).

6.2.1 Target Loss Functions

The canonical loss function, which we focus on in the singleton choice setting, is the categorical 0/1-loss

[TABLE]

i. e., in case the ground-truth choice $C$ is $\{\boldsymbol{x}\}$ , each false prediction $C^{\prime}\not=\{\boldsymbol{x}\}$ is penalized with a loss of $1$ . In addition, we will call the quantity $1-L_{0/1}(C,C^{\prime})$ the categorical accuracy. Moving from singleton to subset choice, where $C$ and $C^{\prime}$ can now be choice sets of arbitrary size, the same loss function (20) can still be used. To signify that it is used in subset choice, we will call it the subset $0/1$ -loss. Targeting the subset $0/1$ -loss is problematic, especially whenever a task $Q$ contains many objects, since already one incorrectly predicted object results in the whole prediction being declared incorrect. One could instead opt to consider the average of the item-wise $0/1$ -loss, which is called the Hamming loss in the setting of multi-label classification [54]. However, this loss exhibits some properties that could be questioned in the context of choice. In particular, the non-prediction of a selected item (false negative) is penalized in the same way as the prediction of a non-selected item (false positive), although positives and negatives might be highly imbalanced.

A more suitable measure, which is widely used in classification, is the $F_{1}$ -measure defined as

[TABLE]

for any $C,C^{\prime}\in\mathcal{C}$ . This measure takes values in $[0,1]$ and large values indicate conformity between $C$ and $C^{\prime}$ , whence an appropriate loss can be defined as 222Later on, we will nevertheless report the $F_{1}$ -measure itself, which is common practice in machine learning.

[TABLE]

In spite of the existence of other measures that specifically aim at correctly predicting positives, such as the informedness [88, 87], we will mostly focus on $L_{F_{1}}$ as the target loss, because it is well known and commonly used as a performance metric. That means that we will use it as the validation loss for the Bayesian hyperparameter optimization we run for every learner. Additional evaluation measures we report are described in Appendix B.

6.2.2 Surrogate Losses

The probabilistic setting for choice that we introduced in Section 3 suggests a natural approach to learning and prediction:

•

First, a learner is trained using the log-likelihood of the probabilistic model as a loss function. This loss function is not only differentiable, but also calibrated in the sense of being minimized by the true (conditional) probabilities. In other words, a learner trained with this loss is supposed to predict (unbiased) probabilities on the choice space $\mathcal{C}$ (conditioned on the query).

•

Thus, given a query for which a prediction is sought, a probability distribution on the choice space $\mathcal{C}$ can be obtained as a prediction, which in turn allows for minimizing any target loss in expectation.

More specifically, let $U(\cdot,Q)$ denote the latent utility scores $U(\boldsymbol{x},Q)$ , ${\boldsymbol{x}\in Q}$ , predicted by a learner on a query $Q\in\mathcal{Q}$ . In a singleton choice scenario, where the data is supposed to be generated according to choice probabilities ${\operatorname{\mathit{p}}^{\tilde{U}}_{\text{MNL}}(\boldsymbol{x}\mid Q)}=\operatorname{\mathit{p}}_{\text{MNL}}(\boldsymbol{x}\mid Q)$ of the form (5) for some unknown ground-truth $\tilde{U}$ , one may define the corresponding categorical cross-entropy loss gained when observing ${C=\{\boldsymbol{x}\}\in\mathcal{C}}$

[TABLE]

This expression is minimized in case $\boldsymbol{x}=\operatorname*{\arg\,\max}_{\boldsymbol{y}\in Q}U(\boldsymbol{y},Q)$ .

If dealing with subset choice data that is presumably sampled according to the choice probability distribution $\operatorname{\mathit{p}}^{U}(C\mid Q)=\operatorname{\mathit{p}}(C\mid Q)$ from (6), it is natural to measure prediction $C\in 2^{Q}\setminus\{\emptyset\}$ by means of the corresponding binary cross-entropy loss

[TABLE]

In spite of the theoretical justification of the logistic losses discussed above, we found that “hinge-variants” of the respective 0/1-losses may sometimes lead to more stable results. More specifically, for the singleton choice setting categorical hinge loss defined via

[TABLE]

for any $\boldsymbol{x}\in Q\in\mathcal{Q}$ , is inspired by the hinge loss used in multi-class classification [29, 78] and can be used instead of (22).

Finally, for training FATE-Net and FETA-Net in the experiments below, we use the binary cross-entropy loss for the subset choice setting and the categorical hinge loss for the singleton choice setting, since these turned out to work well in preliminary experiments. In addition, an $L_{2}$ -regularization term for the magnitude of the weights is added and optimized as part of the loss during training.

Convexity of the Surrogate Losses

An important consideration for the surrogate losses to be used during training is whether they are convex with respect to the utility scores $U(\boldsymbol{x},Q)$ . All three losses introduced above are indeed convex. To see this for $L_{\text{CE}}$ , notice that (22) can equivalently be written as $\log\bigl{(}\sum\nolimits_{\boldsymbol{y}\in Q}\exp(U(\boldsymbol{y},Q)-U(\boldsymbol{x},Q))\bigr{)}$ . The inner difference of utilities is linear and therefore convex. The outer function is also known as LogSumExp and is defined via $\operatorname{LSE}(\boldsymbol{x})\coloneqq\log(\sum_{j\in[m]}\exp(x_{j}))$ . It is convex and since it is also strictly decreasing in each argument, the composition (22) is convex as well.

As for the binary cross-entropy $L_{\text{BE}}$ , note that the inner function $s\colon\mathbb{R}\longrightarrow\mathbb{R}$ , $s(x)\coloneqq\log(1+\exp(x))$ of (23) is smooth with strictly positive first and second derivatives and hence convex and non-decreasing. Similarly, $\tilde{s}(x)\coloneqq s(x)-x$ is convex and strictly decreasing on $\mathbb{R}$ . Hence, we can conclude that (23) is convex.

Finally, the categorical hinge (24) contains the function $h\colon\mathbb{R}^{m}\longrightarrow\mathbb{R}$ , $\boldsymbol{x}\mapsto\log\bigl{(}\sum_{j\in[m]}\exp(x_{j}-x_{i})\bigr{)}$ , which is convex as the logarithm of a maximum of convex functions. Since $s\colon\mathbb{R}\longrightarrow\mathbb{R}$ , $x\mapsto\max(1+x,0)$ is convex and non-decreasing, $s\circ h$ and therefore (24) is convex as well.

The FETA model further decomposes $U(\boldsymbol{x},Q)$ into an aggregation of sub-utility functions $U_{0}$ and $U_{1}$ . It is therefore interesting to ask whether the surrogate losses are also convex with respect to the sub-utility values $U_{0}(\boldsymbol{x})$ , $U_{1}(\boldsymbol{x},\{\boldsymbol{y}\})$ . We can answer this question in the affirmative, since the FETA utility values are positively weighted sums of these sub-utility scores.

However, the overall learning problem depends on the parameter $\theta$ of the realization of $U_{\text{FETA}}$ and $U_{\text{FATE}}$ and the corresponding loss function can possibly still be non-convex w.r.t. $\theta$ (as this is the case with the neural networks employed here). That means in practice we lose the guarantee of stochastic gradient descent to find a global optimum, but with careful tuning of the optimization process one can still expect to find reasonable solutions.

6.3 Datasets

We now introduce the learning problems used for the empirical comparison as follows:

(a)

The Medoid problem, where the task is to predict the medoid of a set of points in a Euclidean space. 2. (b)

The Pareto-front problem, in which the learner has to predict the set of points which are Pareto-optimal. 3. (c)

The Hypervolume singleton choice problem, where the task is to select the point of the Pareto-front which contributes the most to the hypervolume. 4. (d)

Different choice problems defined on the well-known MNIST dataset. 5. (e)

Similarity/dissimilarity-based movie selection using the MovieLens Tag Genome dataset [118]. 6. (f)

The LEarning TO Rank (LETOR) MQ $2007$ and MQ $2008$ datasets [89] consisting of query-document pairs, with the goal to select the relevant documents. 7. (g)

The Expedia hotel dataset featuring search results and relevance labels for each hotel with the goal to select booked/considered hotels [33]. 8. (h)

The Sushi dataset, where the task is to choose the most preferred sushi from a set of $10$ options provided to a user.

See Table 1 for an overview of the datasets and their properties. In the following sections, we will describe the different datasets, their motivation, and if applicable, how they are generated.

6.3.1 The Medoid Problem

The motivation for this problem is the general idea of learning to choose a most representative element from a set. More concretely, the medoid of a set is the object with the smallest cumulative dissimilarity to all other objects of the set333As opposed to the centroid, which is usually not part of the original set.. It is commonly used as a representative element, especially for structured objects such as graphs, $2$ -D trajectories, images, etc. [116, 127].

Formally, we are interested in learning the choice function $c_{\operatorname{medoid}}\colon\mathcal{Q}\longrightarrow\mathcal{C}$ given as

[TABLE]

where we write here and throughout the remainder of this paper $\lVert\cdot\rVert$ for the standard euclidean norm defined as $\lVert\boldsymbol{z}\rVert=\sqrt{\boldsymbol{z}^{t}\boldsymbol{z}}$ . The singleton choice produced by this procedure incorporates all pairwise distances among the objects, which makes it a good context-dependent learning problem to investigate. In particular, $c_{\operatorname{medoid}}$ is sensitive to changes of the elements in the task. With $U_{0}(\boldsymbol{x})\coloneqq 0$ and $U_{1}(\boldsymbol{x},\{\boldsymbol{y}\})\coloneqq-\lVert\boldsymbol{x}-\boldsymbol{y}\rVert$ we clearly have

[TABLE]

and thus $U_{\text{FETA}}^{U_{0},U_{1}}$ is able to exactly model $c_{\operatorname{medoid}}$ .

In contrast to this, for the FATE approach, it is not immediately obvious if and how it is capable of modelling $c_{\operatorname{medoid}}$ exactly. However, the choices $\mathcal{Z}\coloneqq\mathcal{X}$ , $\phi\coloneqq\mathrm{id}_{\mathcal{X}}$ and $U^{\prime}(\boldsymbol{x},\boldsymbol{z})\coloneqq-\lVert\boldsymbol{x}-\boldsymbol{z}\rVert$ yield

[TABLE]

with $\operatorname{centroid}(Q)\coloneqq\frac{1}{|Q|}\sum_{\boldsymbol{y}\in Q}\boldsymbol{y}$ being the centroid of $Q$ . Thus, the item $\boldsymbol{x}\in Q$ , which is closest to $\operatorname{centroid}(Q)$ , i. e., $\operatorname*{\arg\,\max}_{\boldsymbol{x}\in Q}U_{\mathrm{FATE}}^{U^{\prime},\phi}(\boldsymbol{x},Q)$ , is likely to coincide with the medoid of $Q$ . As we construct our synthetic medoid dataset by sampling $Q$ according to the uniform distribution $\nu$ on $\{A\subseteq[0,1]^{d}:|A|=r\}$ for some predefined $r\in\mathbb{N}$ , there is with $U_{\mathrm{FATE}}^{U^{\prime},\phi}$ a FATE-instance, which is expected to have (for the case of singleton choice) an accuracy of at least

[TABLE]

on the synthetic medoid dataset. An empirical evaluation revealed that this value is $89.56\text{\,}\mathrm{\char 37\relax}$ for $r=10$ and $d=5$ . For the details on this dataset, confer Section E.1.

6.3.2 The Pareto-Front Problem

The computation of a Pareto-optimal set of points is an important problem in optimization and various fields of application [41]. We say $\boldsymbol{x}\in\mathcal{X}\subseteq\mathbb{R}^{d}$ is dominated by $\boldsymbol{y}\in\mathbb{R}^{d}$ (short: $\boldsymbol{y}\succ\boldsymbol{x}$ ) if $x_{i}\leq y_{i}$ holds for any $1\leq i\leq d$ and $x_{j}<y_{j}$ for at least one $1\leq j\leq d$ . For any set $Q\in\mathcal{Q}$ we define the Pareto-set or Pareto-front of $Q$ as

[TABLE]

We wish to investigate the possibility to learn the mapping from sets of points to their respective Pareto-sets. It is clear that the size of the Pareto-sets is not constant, which makes it a good candidate for a general subset choice problem. With the choices $U_{0}(\boldsymbol{x})\coloneqq 0$ and $U_{1}(\boldsymbol{x},\{\boldsymbol{y}\})\coloneqq-\llbracket\boldsymbol{y}\succ\boldsymbol{x}\rrbracket$ we have

[TABLE]

Hence, $c_{\mathrm{Pareto}}(Q)=\operatorname*{\arg\,\max}_{\boldsymbol{x}\in Q}U_{\mathrm{FETA}}^{(U_{1})}(\boldsymbol{x},Q)$ holds trivially for each $Q\in\mathcal{Q}$ , i. e., the Pareto problem is exactly solvable via the FETA approach. We created our corresponding synthetic dataset by generating a set of points uniformly at random in $\mathbb{R}^{2}$ and $\mathbb{R}^{5}$ to construct a choice task $Q$ , and the ground-truth is the Pareto-set of $Q$ containing only the non-dominated objects. In order to perform the experiments, we generate sets of $30$ random points in $\mathbb{R}^{2}$ and $\mathbb{R}^{5}$ , and determine the choices as described in detail in Section E.2.

6.3.3 Hypervolume

A related but much harder problem is the computation of hypervolume contributions of objects on a Pareto front. The hypervolume $\lambda_{\mathrm{HypVol}}(Q)$ of a subset $Q\subseteq\mathbb{R}^{d}$ describes the volume of the union of the subspaces dominated by each individual point $\boldsymbol{x}=(x_{1},\dots,x_{d})$ in the Pareto set of $Q$ and can formally be defined as

[TABLE]

where $\lambda$ denotes the Lebesgue measure of $\mathbb{R}^{d}$ . In the context of multi-objective evolutionary algorithms, one usually computes the contributions $\lambda_{\mathrm{HypVol}}(Q)-\lambda_{\mathrm{HypVol}}(Q\setminus\{\boldsymbol{x}\})$ of each point $\boldsymbol{x}\in Q$ to the overall hypervolume $\lambda_{\mathrm{HypVol}}(Q)$ , i. e., the reduction in hypervolume caused by removing one object from the set. We consider the problem of learning the corresponding Hypervolume choice function $c_{\mathrm{HypVol}}\colon\mathcal{Q}\longrightarrow\mathcal{C}$ , which picks that element $\boldsymbol{x}\in Q$ with the smallest contribution to the overall hypervolume, i. e.,

[TABLE]

As shown by [17, Theorem 1 ], it is #P-hard to calculate $c_{\mathrm{HypVol}}(Q)$ . Here, we generate sets of $10$ random points in $\mathbb{R}^{2}$ and determine the singleton choice.

6.3.4 MNIST Number Problems

The original goal of the Modified National Institute of Standards and Technology (MNIST) dataset was to facilitate the comparison between different handwritten digits classifiers [65]. It consists of $70\,000$ $28\times 28$ grayscale images. We use the dataset to create challenging choice problems, both singleton and general subset choice. To level the playing field between all the approaches, we first train a convolutional neural network (CNN) on $10\,000$ instances and use it to extract high level features for the remaining $60\,000$ images (see Section E.3 for more details). To convert this dataset to a choice problem, we randomly sample sets of $10$ numbers and choose based on the following procedures:

Mode: For the Mode dataset, we choose the numbers that occur most often in the choice task $Q$ . For example, given a set of numbers $\{1,$ $1$ , $2$ , $4$ , $4$ , $5$ , $5$ , $6$ , $6$ , $6\}$ , we choose all instances with value equal to the mode value $6$ . For the singleton choice task, we only output one of the numbers (the representation of which has the least angle to a predefined vector). 2. 2.

Unique: Here, we choose all numbers that occur only once in the set of sampled label values. For example, given a set of numbers $\{1$ , $1$ , $2$ , $3$ , $4$ , $4$ , $5$ , $5$ , $6$ , $6\}$ , we choose the numbers $\{2,3\}$ . For the singleton choice problem, we ensure that exactly one of the digits is unique.

6.3.5 MovieLens Tag Genome

The MovieLens Tag Genome dataset consists of a large collection of movies and community curated tags [118]. For each movie, the relevance of every tag is provided on a continuous scale in $[0,1]$ . Thus, the complete relevance vector of a movie can be regarded as that movies’ “genome.”

We consider the problem of choosing the most similar/dissimilar movie from a set of movies, where one movie is regarded as the reference to which the others are compared. We define this reference movie to be the medoid of the movies in a given set. To compute similarities in tag relevance space, we use the weighted cosine similarity as proposed by [117].

6.3.6 LETOR

LETOR is a collection of benchmark datasets for different learning-to-rank problems [89]. The Gov2 web page collection, consisting of roughly 25 M pages, is the corpus and the query sets of the Million Query track of the TREC $2007$ and $2008$ [111, 112] are used to create $8$ datasets. Each query-document pair is defined by a vector consisting of $46$ features. We use the supervised ranking datasets MQ $2007$ and MQ $2008$ to create the choice dataset. We treat all documents with a relevance score of 1 and 2 as the chosen objects. Since all queries include multiple documents with relevance scores $1$ and $2$ , we cannot extract singleton choices from this dataset. The listwise ranking datasets MQ $2007$ -list and MQ $2008$ -list contain real-valued scores of the documents in the underlying permutations, and hence facilitate the singleton choice for each query (details of the exact procedure can be found in Section F.1).

6.3.7 Expedia

The Expedia dataset was released on the Kaggle website as a competition in 2016 [33]. It consists of $399\,344$ lists of hotels, each resulting from a search query of a user. For each hotel, there are $45$ features and a relevance score, indicating how relevant the hotel is to the provided query. A score of [math] means that it was not relevant, a score of $1$ indicates that the user clicked on it, and a $2$ implies that the hotel was booked. It is straightforward to construct choice datasets: for singleton choice the goal is simply to predict the booked hotel, whereas for subset choice we required the learners to output the complete set of hotels that were at least clicked on (see Section F.2 for more details).

6.3.8 SUSHI

SUSHI444This dataset can be downloaded from http://www.kamishima.net/sushi/ is a dataset created by [57] specifically for the task of object ranking. The authors considered $100$ sushis and asked users to rank them according to their preference. The dataset consists of two sets of $5000$ rankings. Each ranking consists of $10$ sushis, which were ranked by users in a survey. For the first set, the authors asked the users to rank the top-10 most popular sushis. In the second set, users were shown random sets of $10$ sushis instead. Each sushi is described by $7$ object features. Additional user features are available, but not used in our experiments. For our experiments, we merge both datasets into a single one containing $10\,000$ instances. We use it as a singleton choice dataset by choosing the most preferred sushi as the singleton choice for the given task set $Q$ (details of the exact procedure can be found in Section F.3).

6.4 Results and Discussion

In this section, we provide the results obtained by evaluating different subset choice and singleton choice models on the datasets. To be concise, we only show plots for the target losses here and list the complete set of results in Tables 9, 10 and 11 in Appendix G. It is illuminating to compare the performance of FATE-Net and FETA-Net to the context-independent neural network RankNet. This provides a rough indicator for how important being able to model context-dependence is.

6.4.1 Singleton Choice

We will start by discussing the results for the singleton choice models (cf. Figure 5), where the bars depict the mean value of the categorical accuracy (26) across the cross-validation folds, with black lines depicting the standard deviation.

The first observation is that FATE-Net and FETA-Net significantly outperform all other baselines on the tasks for which it was clear that the underlying choice function is context-dependent (i. e., Hypervolume, Medoid and the MNIST datasets). The SDA network, which is also a context-dependent model, achieves competitive results on the Medoid and the MNIST datasets. The linear FETA variant FETA-Linear non-linear neural network RankNet perform comparably to the other baseline approaches. This suggests that a combination of non-linearity and the ability to model context-dependence is really necessary to improve on these tasks. One notable exception is the Medoid dataset, for which RankNet and FETA-Linear manage to outperform the other baselines by a large margin.

For the MNIST-Unique problem, FATE-Net and FETA-Net achieve an accuracy of more than $90\text{\,}\mathrm{\char 37\relax}$ and SDA is competitive with over $80\text{\,}\mathrm{\char 37\relax}$ . Additionally, the GNL and ML models are also able to perform better than the other baselines. It is easy to see that the dataset exhibits the similarity context effect proposed by [52], i. e., adding multiple instances of the same digit to the choice task reduces the choice probability of all equal digits to 0. As is apparent, the GNL and ML model are able to account for it and score better than chance.

Since FATE-Net, FETA-Net and SDA were able to achieve close to $100\text{\,}\mathrm{\char 37\relax}$ accuracy on the MNIST-Unique problem, we performed an additional experiment where we generated instances completely synthetically. Each number $i\in\{0,\dots,9\}$ we represent by the corresponding standard unit vector $\boldsymbol{e}_{i}$ , which is $1$ in the $i$ -th position and is [math] everywhere else. Apart from that, the task remains the same. We calibrate each network to have roughly the same number of parameters ( $2870$ for FATE-Net, $2849$ for FETA-Net and $2850$ for SDA) and the remaining hyperparameters were equal for all networks. We then trained them on a stream of newly generated batches with 1024 instances, each of which with 10 objects until convergence. The resulting convergence behavior is shown in Figure 6. Both FETA-Net and SDA are able to converge to $100\text{\,}\mathrm{\char 37\relax}$ out-of-sample categorical accuracy within 100 epochs, while FATE-Net only achieves slightly over $60\text{\,}\mathrm{\char 37\relax}$ and more epochs alone were not able to let it learn the target function without error. We therefore repeated the experiment for FATE-Net with a higher epoch and parameter budget. With $5985$ parameters, FATE-Net is now able to perfectly learn the fully synthetic unique problem within 400 epochs. On the one hand, this shows that from a representational perspective, all three models are able to learn this particular target choice function perfectly. FATE-Net appears to be less parameter- and data-efficient though, which could indicate that evaluating the utilities in the context of the set embedding is not well suited to represent these kinds of problems. The behavior of all three networks was consistent across repetitions of the experiment.

On the real-world datasets (i. e. Sushi, Movielens Tag Genome, LETOR and Expedia) the performance of FATE-Net and FETA-Net is closer to the ones achieved by the remaining baselines. Although they still obtain slightly higher accuracy on average, the margin is not as pronounced. Surprisingly, the SDA achieved the worst accuracy on LETOR and Expedia. We suspect that this results from the models being trained only on a fixed choice task size in our experiments, while they are evaluated on choice tasks of varying size during test time. Since SDA learns a set-dependent aggregation function, it could be that this does not generalize well to the larger choice tasks present in the real-world datasets.

6.4.2 Subset Choice

We evaluate the subset choice models in terms of their $F_{1}$ -measure (21) and report the results in Figure 7. To see if the models are able to learn anything, we also show the performance of the baseline that always predicts positive.

The general pattern is confirmed: FATE-Net, FETA-Net and SDA surpass the other baselines on the datasets Pareto-front 2D, MNIST Mode, and MNIST Unique, while being competitive for the real-world datasets LETOR and Expedia. For the MNIST tasks Unique and Mode, the first observation is that all linear and/or context-independent baseline approaches fail to learn anything on these datasets, since they all achieve the same $F_{1}$ -measure as the all-positive baseline. Thus, it is clear that these tasks can only be solved by models that are both context-dependent and non-linear.

For the Pareto problem, it can be observed that the context-dependent models FETA-Net, FATE-Net, and SDA outperform all benchmark choice models on the 2D version. On the 5D version of the dataset, however, the performance of all approaches reach a comparable level. This indicates that solving the task of selecting the Pareto-front becomes less context-dependent in higher dimensions, since the distance of a point from the center becomes more and more informative. At the same time, more points are on the Pareto-front overall, which is apparent from the high $F_{1}$ -measure of the AllPositive baseline.

As before, the results are more homogeneous on the real-world datasets Expedia and LETOR MQ $2007$ /MQ $2008$ . FATE-Net and FETA-Net are still outperforming all the benchmarks. This suggests that the ability to model context-dependence in the data is slightly more important for these datasets than learning a non-linear utility function. SDA achieves the best result on the Expedia dataset, which when compared to the bad performance on the singleton choice variant of the dataset suggests that the thresholding of the utilities is robust to the model output changing with varying choice task sizes.

Overall, the results demonstrate that FATE-Net and FETA-Net are able to improve on the context-independent baselines by a large margin on tasks which are strongly context-dependent and show competitive results when compared to SDA. The improvement is due to both the task-sensitivity of these models and the ability to model non-linear utility functions. For the real-world datasets, the improvements are smaller, suggesting that context-effects are either less pronounced or that the context-effects in real-world data cannot fully be captured yet.

6.4.3 Generalization Across Task Sizes

We conduct additional experiments to gauge the generalization capability of the learned models to unseen task sizes (refer to Appendix D for more details). We show the results for the datasets Medoid and Hypervolume, because, as will be seen, they exhibit some interesting properties. We specifically compare the performance on the singleton choice datasets (Figure 8).

We train the models on a fixed task size and then test them on sets containing between $3$ and $21$ objects. Note that for singleton choice, the accuracy is not comparable across differing task sizes. We instead report the normalized accuracy (see Section B.2), which fixes this issue and guarantees that random guessing achieves exactly 0.

Overall, the models manage to generalize quite well to task sizes for which they were not trained. The exact generalization behavior depends on the dataset, though. Considering the Medoid dataset, we can observe that the models FETA-Net, FETA-Linear and RankNet even improve in performance with larger task sizes. This is plausible, since the more points fill the space, the more the problem can be solved by a context-independent model, which assigns the highest score to objects in the center. For the singleton choice version of Hypervolume, on the other hand, the performance of all models drops with an increasing numbers of objects, suggesting it becomes much harder to identify the object that contributes the most to the overall hypervolume. This is especially visible for the baselines, which, even though they were trained on 10 objects, achieve their best performance on 3 objects. FETA-Net, FATE-Net, and FETA-Linear stand out here, since their performance decays much slower. All in all, we conclude that our networks FETA-Net and FATE-Net are able to generalize very well to unseen task sizes, with FETA-Net additionally benefiting if the task becomes less context-dependent with larger task sizes.

7 Conclusion and Future Work

In this paper, we tackle the problem of choice from a machine learning perspective. More specifically, we propose a framework for learning context-dependent choice functions, which, on the basis of choice behavior observed in the past, allow for predicting the choice of objects in new situations. This is essentially accomplished by learning generalized (latent) scoring (utility) functions, which are supposed to control the choice behavior.

Violations of context-independence are common in human choice behavior. Therefore, accounting for the various context effects they can exhibit can be seen as an important problem. Still, we consider the space of interesting non-trivial choice functions to be vastly larger, and the goal is to have general purpose models that can adapt to a wide variety of (yet unknown) context effects.

To this end, we propose two principled decompositions: The FETA decomposition is a first-order approximation to a more general utility decomposition. It considers each object in local sub-contexts, the contributions of which are averaged. The FATE approach, on the other side, first transfers each object into an embedding space and computes a representative of the choice task by averaging these embedded points. The utility of each object is then evaluated with the representative as global context. Both approaches are complementary and have differing inductive biases. In spite of this, both show promising predictive performance.

While the FETA and FATE decompositions are general and in a sense quite natural approaches to model context-dependent choice functions, a promising direction is the investigation of application-specific models with more focused inductive biases. An example is the SDA approach, which applies principles from behavioral choice theory and also tries to take the risk-aversion of humans into account [95].

While the most influential context effects for human choices have been studied, gaining a deeper understanding of the rich mathematical structure of general choice problems is an important future endeavor.

Acknowledgements

The authors gratefully acknowledge the financial support provided by the European Regional Development Fund (ERDF) and the valuable feedback provided by the industry partners of the Smart-GM research project – EFRE-0801915.

Funded by the Deutsche Forschungsgemeinschaft (DFG – German Research Foundation) – 317046553.

This work is part of the Collaborative Research Center “On-the-Fly Computing” at Paderborn University, which is supported by the German Research Foundation (DFG). Experiments were performed on resources provided by the Paderborn Center for Parallel Computing.

Appendix A Notation

Appendix B Evaluation Measures

Besides the target losses introduced in Section 6.2, we evaluate the trained models using additional evaluation measures. These should give a more complete picture of the performance of the different models. The results including the additional measures can be found in Appendix G.

B.1 Singleton Choice

To define the evaluation measures in the singleton choice setting, suppose in the following a choice task space $\mathcal{Q}\subset 2^{\mathcal{X}}$ , a utility function $U$ for $\mathcal{Q}$ as well as $Q\in\mathcal{Q}$ and $\boldsymbol{x}\in Q$ to be arbitrary but fixed.

Top- $k$ Categorical Accuracy

The top- $k$ categorical accuracy is defined as the fraction of times in which the set of objects in the top $k$ positions, according to the predicted scores, contains the ground-truth chosen object [23, 9]. Formally, writing $Q=\{\boldsymbol{y}_{1},\dots,\boldsymbol{y}_{|Q|}\}$ with $U(\boldsymbol{y}_{1},Q)\geq\dots\geq U(\boldsymbol{y}_{|Q|},Q)$ , we have

[TABLE]

Categorical Accuracy

The categorical accuracy is defined as the fraction of times in which the object with the largest score is the same as that ground-truth singleton choice, i. e.,

[TABLE]

The categorical accuracy is the most common measure used for the evaluation of SCMs and commonly referred to as hit-rate [9]. It is evident that $\operatorname{m}_{\text{CA}}(U,Q,\{\boldsymbol{x}\})=\operatorname{m}_{\text{top-$ 1 $}}(U,Q,\{\boldsymbol{x}\})$ holds, provided $\operatorname*{\arg\,\max}_{\boldsymbol{y}\in Q}U(\boldsymbol{y},Q)$ is a singleton set.

Normalized Accuracy

The measures defined above are not a reasonable estimate when observing the performance of an SCM on the choice tasks of different sizes $\lvert Q\rvert$ , since the task becomes harder as the choice task size increases. The hardness of the task should be adjusted with respect to the accuracy that random guessing can achieve, which is defined as the probability of choosing the correct singleton choice from the choice task $Q$ . Assuming each object to be chosen with the same probability, the probability for choosing a fixed object is $\frac{1}{|Q|}$ . These considerations motivate the definition of the normalized accuracy as follows:

[TABLE]

Note that this measure takes values in $[-\frac{1}{\lvert Q\rvert-1},1]$ . The minimum value of $-\frac{1}{\lvert Q\rvert-1}$ is achieved when the algorithm performs with an accuracy of [math], i. e., it is worse than random guessing, and the maximum value of $1$ when the learner always predicts correctly. A value of [math] indicates that the learner performs similar to random guessing. This measure was derived using the “correction for guessing” formulation [28].

B.2 Subset Choice

For the subset choice setting, we introduce accuracy measures in terms of a choice task $Q$ and two corresponding choices $C,\widehat{C}\subseteq Q$ for $Q$ . Here, $C$ may be thought of as the ground-truth choice for $Q$ and $\widehat{C}$ as a prediction made by a learner. In contrast to the singleton choice setting, these measures do not depend on a utility function. For the sake of convenience, we suppose $Q$ , $C$ and $\widehat{C}$ to be arbitrary but fixed in the following. To prepare some of the measures, let us formally define the quantities true positives ( $\widehat{TP}$ ), true negatives ( $\widehat{TN}$ ), false positives ( $\widehat{FP}$ ) and false negatives ( $\widehat{FN}$ ) via

[TABLE]

respectively. These quantities are similar to those used to define the confusion matrix in the case of binary classification [64].

Subset $0/1$ Accuracy

The Subset $0/1$ Accuracy measures the number of times the ground-truth choice set $C$ and the predicted choice set $\widehat{C}$ are exactly the same. This measure is used to measure how often the algorithms predictions match the complete choice set. Formally, it is defined as

[TABLE]

Recall

Recall is defined as the proportion of real positive cases that are correctly predicted positive [87]. In the field of information retrieval, it is the fraction of the relevant documents that are successfully retrieved. For our choice setting this can be defined as the fraction of objects from the ground-truth choice set $C$ which chosen successfully or are present in the predicted choice set $\widehat{C}$ , i. e., formally as

[TABLE]

Precision

Precision denotes the proportion of predicted positive labels that are correct [87]. For the choice setting, this can be defined as the fraction of objects from the predicted choice set $\widehat{C}$ that are actually chosen by the decision maker or that are present in the ground-truth choice set $C$ . Formally, it is defined as:

[TABLE]

$F_{1}$ -Measure

The $F_{1}$ -measure is defined as the harmonic mean of precision and recall:

[TABLE]

It can also be expressed in form of the confusion matrix quantities as follows [64]:

[TABLE]

Informedness

The informedness is a measure proposed by [88, 87], which is, in contrast to the $F_{1}$ -measure, unbiased with respect to the population prevalence of positives. It specifies the probability that the learner makes an informed prediction if compared to chance and is formally defined as

[TABLE]

A very desirable property of this measure is that it is exactly [math] in case the learner is guessing or is constant.

AUC-ROC

The AUC-ROC is a performance measure, which estimates the capacity of a classification model to distinguish between two classes [35, 74]. It computes the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one [74]. It is estimated by computing the area under the ROC-curve, which is created by plotting the true positive rate $\operatorname{m}_{\text{TPR}}$ against the false positive rate $\operatorname{m}_{\text{FPR}}$ , where

[TABLE]

A very desirable property of this measure is that it exactly $0.5$ in case the learner is guessing.

Appendix C Additional Experimental Details

In this section, we will now list all experimental details which were excluded from the main paper for conciseness reasons. First, we explain the process of nested cross-validation using the hyperparameter optimization in detail. Then we explain different hyperparameters which were tuned for different models and which parameters were kept fixed. Lastly, we explain the design generalization experiment.

Empirical Comparison

In order to compare all learners fairly, we do nested cross-validation with synchronized random streams for all the learning models, as shown in Figure 9. The hyperparameters of all models are tuned using extensive Bayesian optimization. We describe the complete procedure in two parts: first the hyperparameter optimization and second the out-of-sample evaluation. First, we configure the given the learner $M$ with the default parameters $p_{d}$ described in the next section. Then we generate $5$ sets of training $\mathcal{D}_{k}$ and test dataset $\mathcal{D}_{Tk}$ $\forall k\in[5]$ and the process which is used to generate a train-test set for $k$ is described in Table 1.

Hyperparameter Optimization

The training set $\mathcal{D}_{k}$ is used to first identify the best hyperparameters using $3$ -fold stratified cross-validation, and then to train the final learner for out-of-sample evaluation. The hyperparameter optimizer picks hyperparameters from the ranges in Table 3 ( $p_{i}$ ) for the $i^{th}$ iteration. In the inner loop $1\leq j\leq 3$ , we split the full training dataset $\mathcal{D}_{k}$ into train set ( $D_{kj}$ $90\text{\,}\mathrm{\char 37\relax}$ of $\mathcal{D}_{k}$ ) and validation dataset ( $V_{kj}$ $10\text{\,}\mathrm{\char 37\relax}$ of $\mathcal{D}_{k}$ ) using the stratified shuffle split. For the given hyperparameters $p_{i}$ , we train the model on the train set ( $D_{kj}$ ) and evaluate on the validation dataset $V_{kj}$ using the target loss function. We use the $1-F_{1}$ -measure for general subset choice and the 1-categorical accuracy for singleton choice as the target loss to evaluate the hyperparameter configuration. We calculate the mean loss $\ell_{i}=\text{mean}(l_{1},l_{2},l_{3})$ for the given hyperparameters $p_{i}$ . The optimization loop is run for $100$ iterations to validate $100$ sets of hyperparameters, in order to acquire the optimal parameters $p_{b}$ for the given learning model.

Out-of-Sample Evaluation

Finally, after optimization, we configure the learners $M$ using the best found hyperparameters $p_{b}$ and the remaining default parameters $p_{d}$ . Then, we train the model $M$ on the complete training dataset $\mathcal{D}_{k}$ and evaluate on the test dataset $\mathcal{D}_{Tk}$ using different evaluation measures $\operatorname{m}$ defined in Appendix B. To obtain a good estimate of the mean performance and an estimate for the standard deviation, we repeat this procedure $5$ times using outer cross-validation. For each fold $k\in[K],K=5$ , we get the evaluated value $a_{k}$ and calculate the mean and the standard deviation of the performance measure $\operatorname{m}$ .

Hyperparameters & Inference

We will now describe the specific hyperparameters we optimize and which ranges of values we consider (see Table 3 for an overview). For probabilistic models, we also describe how the inference is done. For all neural network models, we make use of the following techniques:

•

We use either rectified linear units (ReLU) non-linearities in conjunction with batch normalization (BN) [53] or self-normalizing linear units (SELU) non-linearities [62] for each hidden layer.

•

Regularization: $L_{2}$ penalties are applied and the corresponding regularization strength is tuned.

•

Optimizer: stochastic gradient descent (SGD) with Nesterov momentum [79].

•

A step-decay function is used for the learning rate annealing schedule. The decay factor is tuned [31].

The step-decay function drops the learning rate by a factor after a certain number epochs [31]. Formally, it is defined as:

[TABLE]

where $\mathit{lr}_{0}$ is the initial learning rate, $0<d_{r}<1$ is the rate with which the learning rate should be reduced, $e$ is the current epoch and $e_{\text{drop}}$ is the number of epochs after which the learning rate is decreased. We set the maximum number of epochs the neural networks are trained for to $1000$ .

The hyperparameters of each algorithm were tuned using the package scikit-optimize [49]. Apart from the number of hidden layers and units, we also tune the learning rate of the stochastic gradient descent optimizer, regularization strength and batch size (fraction of training examples used for estimating the gradient in one iteration). We also tune the drop-rate $d_{r}$ and epoch-drop $e_{\text{drop}}$ for the step-decay function used by the Stochastic gradient descent optimizer by the neural networks. For PairwiseSVM, we tune the value of the penalty parameter $C$ of the error term, and another is tol (tol in scikit-learn) which is the tolerance for the stopping criteria of the optimization algorithm [84]. All of the different GEV models are implemented in PyMC3 a library for facilitating Markov Chain Monte Carlo estimation of the posterior distribution [97]. An overview of all the hyperparameters and their admissible ranges is shown in Table 3.

Threshold Tuning

In order to set the threshold for the subset choice models (4), we tune the threshold for all models on a small validation set. Obviously, an optimal value for $t$ will depend on the underlying target loss function. Our main target loss is the (micro-averaged) $F_{1}$ -measure (21), which balances precision and recall of the predictions [66, 124, 121]. [64] show that tuning a threshold on a validation set, yields a consistent classifier, if the estimated marginal instance probabilities (in our case the choice probabilities) converge in probability to the population-level probabilities. One important difference to the multi-label classification setting is the absence of a fixed set of labels. Instead, we have a dynamically changing set of objects. Thus, it only makes sense to consider micro-averaged performance metrics.

Appendix D Design of the Generalization Experiment

The second experimental setup is designed to gauge the generalization capability of the learning models by measuring the accuracy obtained by a trained model on unseen task set sizes. To this end, we vary the task set sizes from $3$ to $30$ as shown in Figure 10.

First, we configure the learning model with the best hyperparameters $p_{b}$ obtained from the empirical comparison experiment for the given dataset and the remaining default parameters $p_{d}$ . Then we generate the training dataset containing task sets of size $K=\lvert Q\rvert$ and train the configured model on the training dataset $\mathcal{D}_{K}$ .

Finally, we evaluate the trained model $CM$ on different test datasets $\mathcal{D}_{k}$ containing the task sets of sizes in $S$ ( $\lvert Q\rvert=k\in S$ ) as described in Table 4.

Appendix E Synthetic Datasets

In this section, we will formally describe the process of generating the datasets for the experimental evaluation. In the case of synthetic datasets, this entails the complete process by which the objects and queries are generated.

E.1 The Medoid Problem

Recall that we have defined the medoid of a set $Q\subset\mathbb{R}^{d}$ as $c_{\operatorname{medoid}}(Q)=\operatorname*{\arg\,\min}\nolimits_{\boldsymbol{x}\in Q}\frac{1}{|Q|}\sum\nolimits_{\boldsymbol{y}\in Q}\lVert\boldsymbol{x}-\boldsymbol{y}\rVert$ , where $\lVert\cdot\rVert$ is the standard euclidean norm in $\mathbb{R}^{d}$ . Thus, the medoid of $Q$ may be thought of as the most centrally located object in $Q$ , cf. the illustration of a choice set $Q$ of size $5$ and its medoid in Figure 11(a). As it depends on its distance to any other point from $Q$ , the medoid of $Q$ is sensitive to changes of any points in $Q$ .

For our empirical study, we created a dataset $\mathcal{D}=\left\{(Q_{1},C_{1}),\dotsc,(Q_{N},C_{N})\right\}$ by drawing each $Q_{i}$ independently and uniformly at random from the set

[TABLE]

and then choose $C_{i}\coloneqq c_{\operatorname{medoid}}(Q_{i})$ . Here, the sampling step can be performed via the acceptance-rejection method: One may repeatedly sample $\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}$ uniformly at random from $[0,1]^{d}$ until $Q=\{\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}\}$ has size $n$ and a unique medoid. Regarding that this condition is already fulfilled with probability $1$ after sampling $\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}$ only once, this method is efficient.

E.2 The Pareto Problem

Above, we introduced the Pareto set $c_{\mathrm{Pareto}}(Q)$ of a set $Q\subset\mathbb{R}^{d}$ as the set of all elements $\boldsymbol{x}\in Q$ which are not dominated by any $\boldsymbol{y}\in Q\setminus\{\boldsymbol{x}\}$ , wherein $\boldsymbol{x}$ was said to dominate $\boldsymbol{y}$ if $\forall i\in[d]\colon x_{i}\leq y_{i}$ and $\exists i\in[d]\colon x_{i}<y_{i}$ . Figure 11(b) shows the Pareto set of a set $Q\subset\mathbb{R}^{2}$ .

With the help of Pareto sets we create a synthetic dataset $\mathcal{D}=\{(Q_{j},C_{j})\}_{j=1}^{N}$ for the subset choice task, where each sample $(Q,C)\in\mathcal{D}$ is generated independently of the others in the following way:

Sample $\boldsymbol{\mu}_{1},\dots,\boldsymbol{\mu}_{n}$ i.i.d. uniformly at random from $\{\boldsymbol{x}\in\mathbb{R}^{d}:\lVert x\rVert\leq 1\}$ 2. 2.

Draw i.i.d. samples $\boldsymbol{\xi}_{1},\dots,\boldsymbol{\xi}_{n}$ from $N(\boldsymbol{0},\boldsymbol{I}_{d})$ , the standard Gaussian distribution on $\mathbb{R}^{d}$ , and define $\boldsymbol{x}_{i}\coloneqq\boldsymbol{\mu}_{i}+\boldsymbol{\xi}_{i}$ for each $i\in[n]$ . 3. 3.

Choose $Q\coloneqq\{\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{n}\}$ and $C\coloneqq c_{\mathrm{Pareto}}(Q)$ .

Hypervolume

In Section 6.3.3 we have introduced for $Q\subset\mathbb{R}^{d}$ the choice set $c_{\mathrm{HypVol}}(Q)$ as the set of all $\boldsymbol{x}\in Q$ , which contribute the least among all elements in $Q$ to the hypervolume of $Q$ , cf. Section 6.3.3 for the precise definitions and also for the connection of the hypervolume of $Q$ to the Pareto front of $Q$ . As this contribution of each point depends on the position of other points in $Q$ , $c_{\mathrm{HypVol}}$ is context-dependent. This is illustrated in Figure 11(b), where all five elements of $Q=\{A,B,C,D,E\}$ lie on the Pareto front of $Q$ . There, the contribution of point $A$ is largest in $Q$ , but if we remove the point $D$ from the choice set, it increases the contribution of the point $E$ for the set. So, the singleton choice changes from $A$ to $E$ , after removing $D$ from $Q$ .

Based on $c_{\mathrm{HypVol}}$ we construct a singleton choice dataset $\mathcal{D}=\{(Q_{i},C_{i})\}_{i=1}^{N}$ by sampling each $Q_{i}$ uniformly at random from the set of all $Q\subseteq\mathbb{R}^{d}$ , which fulfill

[TABLE]

and then defining $C_{i}\coloneqq c_{\mathrm{HypVol}}(Q_{i})$ afterwards. Similarly, as in the construction of the Medoid data set, sampling can be done via the acception-rejection method.

E.3 MNIST Number Problems

In this section, we will describe the process of generating different semisynthetic datasets using the MNIST dataset [65].

Feature Extraction

Since the dataset consists of $2$ -D image maps, we first train an off-the-shelf CNN to solve the digit multi-class classification task to level the playing field and abstract away from the computer vision context. This architecture of the CNN consists of $2$ -D Convolutional, $2$ -D Max-Pooling, and fully-connected dense layers and applied batch normalization to increase the stability of the network, by subtracting the batch mean and dividing by the batch standard deviation as shown in Figure 12 [43, 53].

The $2$ -D convolutional layer is of kernel-size $5\times 5$ using rectified linear units (ReLU) non-linear activation function and l- $2$ regularization and $2$ -D max-pooling layer, with filter of size $2\times 2$ applied with a stride of $2$ , which down-samples the input by $2$ along the width and height, discarding $50\text{\,}\mathrm{\char 37\relax}$ of the activations by applying max operation over $4$ numbers in $2\times 2$ region [43]. The output of these layers is provided as input to a fully-connected sequential network with $10$ outputs, where each output predicts the probability of the input image belonging to a particular class using the softmax [43]. We train this network on $10\,000$ instances, then we transform the remaining $60\,000$ digits to a high-level feature representation by passing them through the trained CNN and recording the $128$ outputs of the last hidden layer (D2).

The transformed MNIST dataset $\mathcal{D}_{M}=\left\{(\boldsymbol{x}_{1},l_{1}),\ldots,(\boldsymbol{x}_{N},l_{N})\right\}$ , is represented as a set of tuples $(\boldsymbol{x}_{i},l_{i})$ , where $\boldsymbol{x}_{i}$ is the feature vector and $l_{i}$ represents the corresponding label, such that $\lvert\mathcal{D}_{M}\rvert=N=60000$ , $\boldsymbol{x}_{i}\in\mathbb{R}^{128}$ , $l_{i}\in\mathcal{L}=\left\{0,1,2,3,4,5,6,7,8,9\right\}$ and $\mathcal{D}_{M}(\boldsymbol{x}_{i})=l_{i}$ holds for all $i\in[N]$ . For constructing the choice datasets, we sample instances $(\boldsymbol{x}_{i},l_{i})\in\mathcal{D}_{M}$ from the transformed dataset uniformly at random, to construct a task set $Q=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n}\}$ . Based on $Q$ and $\boldsymbol{l}=(\mathcal{D}_{M}(\boldsymbol{x}_{1}),\ldots,\mathcal{D}_{M}(\boldsymbol{x}_{n}))$ , we then select as choice set $C=g(Q,\boldsymbol{l})$ , where is an appropriately predefined function $g$ . We consider two variants for $g$ , namely $g_{\mathrm{unique}}$ and $g_{\mathrm{mode}}$ .

The function $g_{\mathrm{unique}}$ outputs the instances corresponding to the numbers which occur only once in the label vector. For example

$g_{\mathrm{unique}}(Q,(4,3,2,3,3,1,8,8,7,7))=\left\{\boldsymbol{x}_{1},\boldsymbol{x}_{3},\boldsymbol{x}_{6}\right\}$ , corresponding to the numbers $4$ , $2$ and $1$ . For singleton choice choice, we sample only the task sets, whose corresponding label vector $\boldsymbol{l}$ contains a single unique number, to make it identifiable, i. e., for example $g_{\mathrm{unique}}(Q,(4,3,3,2,2,1,1,1,5,5,5))=\left\{\boldsymbol{x}_{1}\right\}$ . The section function is $g_{\mathrm{mode}}$ , which outputs the instances corresponding to the number which occur most frequently in the label vector. For example $g_{\mathrm{mode}}(Q,(4,3,2,3,3,8,8,7,7,7))=\left\{\boldsymbol{x}_{8},\boldsymbol{x}_{9},\boldsymbol{x}_{10}\right\}$ , corresponding to the mode $7$ For singleton choice choice, we choose the instances corresponding to the mode, which are at the least angle from a predefined weight vector $\boldsymbol{w}$ .

Both functions used to generate choices depend on all other objects in the given task set $Q$ , thus making the datasets highly context-dependent.

Unique

In this subsection, we explain the data generation process for the Unique choice dataset using the $g_{\mathrm{unique}}$ function defined above. For generating the dataset, we select a set of instances from $\mathcal{D}_{M}$ uniformly at random to construct the task set $Q$ and the label vector $\boldsymbol{l}$ . Then we choose the objects from $Q$ which corresponds to the unique digit in the label vector $\boldsymbol{l}$ (an example is shown in Figure 11(c)). Let us assume we want to generate a dataset $\mathcal{D}=\left\{(Q_{i},C_{i})\right\}_{i=1}^{N}$ with $N$ instances.

Sample $n$ data points $(\boldsymbol{x}_{i,1},l_{1}),\dotsc,(\boldsymbol{x}_{i,n},l_{n})$ from $\mathcal{D}_{M}$ , let $\boldsymbol{l}_{i}\coloneqq(l_{1},\dots,l_{n})$ and $Q_{i}\coloneqq\{\boldsymbol{x}_{i,1},\dots,\boldsymbol{x}_{i,n}\}$ 2. 2.

For each $l\in\mathcal{L}$ let $k_{l}$ be the number of times the label $l$ appears in the label vector $\boldsymbol{l}_{i}$ for $Q_{i}$ , define $\boldsymbol{k}\coloneqq\{k_{0},\dots,k_{9}\}$ and write for convenience $\boldsymbol{k}(l)\coloneqq k_{l}$ in the following. For example for $\boldsymbol{l}=(1,2,4,4,4,5,5)$ we have $\boldsymbol{k}=(0,1,1,0,3,2,0,0,0,0)$ . 3. 3.

We create $C_{i}$ by selecting the objects whose values occur only once in the label vector $\boldsymbol{l}$ :

[TABLE] 4. 4.

In order to create the corresponding singleton choice or top- $1$ version of this dataset, we discard $Q_{i}$ in case $\lvert C_{i}\rvert>1$ and repeat steps 1–4. If $|C_{i}|=1$ instead, we keep the sample $(Q_{i},C_{i})$ .

Mode

In this subsection, we explain the data generation process for the Mode choice dataset using the $g_{\mathrm{mode}}$ function defined above. For generating the dataset, we select a set of instances from $\mathcal{D}_{M}$ uniformly at random to construct the task set $Q$ and the label vector $\boldsymbol{l}$ (an example is shown in Figure 11(c)). Then we choose the objects from $Q$ which corresponds to the mode value of the label vector $\boldsymbol{l}$ to construct the ground-truth set of chosen objects. For creating the corresponding singleton choice or top- $1$ dataset, we choose the object corresponding to the mode value of the label vector, which is at the least angle to the predefined weight vector $\boldsymbol{w}$ . Let us assume we want to generate a dataset $\mathcal{D}=\left\{(Q_{i},C_{i})\right\}_{i=1}^{N}$ with $N$ instances. First, we sample the weight vector $\boldsymbol{w}\in\mathbb{R}^{128}\overset{\text{iid}}{\sim}N(\boldsymbol{0},\boldsymbol{I}_{128})$ .

Sample $n$ data points $(\boldsymbol{x}_{i,1},l_{1}),\dots,(\boldsymbol{x}_{i,n},l_{n})$ uniformly at random from $\mathcal{D}_{M}$ , abbreviate $\boldsymbol{l}_{i}\coloneqq\{l_{1},\dots,l_{n})$ and let $Q_{i}\coloneqq\{\boldsymbol{x}_{i,1},\dots,\boldsymbol{x}_{i,n}\}$ . 2. 2.

As for the Unique dataset, write $k_{l}$ for the number of times the label appears $l$ in the label vector $\boldsymbol{l}_{i}$ for $Q_{i}$ , define $\boldsymbol{k}\coloneqq\{k_{0},\dots,k_{9}\}$ and write again $\boldsymbol{k}(l)\coloneqq k_{l}$ . 3. 3.

For the case of subset choice define

[TABLE]

and in case of singleton choice, select $C_{i}$ to be that set, which contains only the object with the least angle to vector $\boldsymbol{w}$ , i. e.,

[TABLE]

E.4 Tag Genome Dataset

The GroupLens Research group released many datasets collected from the MovieLens website555https://movielens.org/ for research in the field of recommender systems [47]. As of August 2017, the full dataset collected from this website consists of $26\,000\,000$ ratings and $750\,000$ tags applied to $45\,000$ movies by $270\,000$ users [47]. One of the datasets is the Tag Genome dataset666This dataset is available on https://grouplens.org/datasets/movielens/, which provides real-valued features to characterize the movies [117].

Tags are meta-data in the form of keywords, which help to describe an object (such as movie, music, books). In recent years tagging has gained popularity due to the growth of social networking websites and web search engines [105]. On the MovieLens website, users create tags to describe a movie. Other users can then use them to filter movies more effectively. Users can also gain more information about a movie with the help of tags applied by other users.

The Tag Genome dataset was generated by applying machine learning algorithms on the information provided by users for a movie in the form of tags, reviews, and ratings [118]. It consists of movies and a set of tags applied to each of them, and a score between [math] and $1$ quantifying the relevance of each tag to the particular movie (as shown in Figure 13). Currently, this dataset consists of around 12 million relevance scores across $1128$ tags applied on $10\,993$ movies.

Framework

According to [117] the Tag Genome dataset consists of:

$M$ : The set of movies $\left\{m_{1},\ldots,m_{N_{m}}\right\}$ , where $\lvert M\rvert=N_{m}=10993$ . 2. 2.

$T$ : The set of tags $T=\left\{t_{1},\ldots,t_{N_{t}}\right\}$ , where $\lvert T\rvert=N_{t}=1128$ . 3. 3.

$R_{rel}:M\times T\longrightarrow[0,1]$ : Relation such that $R_{rel}(m_{i},t_{j})$ denotes the degree to which extent the tag $t_{j}\in T$ applies to the movie $m_{i}\in M$ on a scale of [math] to $1$ ; here [math] indicates no relevance and $1$ indicates strong relevance to the movie (as shown in Figure 13). 4. 4.

$\mathcal{M}_{f}:M\longrightarrow[0,1]^{N_{t}}$ : Relation mapping each movie to its feature vector in tag-space (vector of tag relevance values across all tags), such that $\mathcal{M}_{f}(m_{i})=\boldsymbol{x}_{i}\coloneqq(R_{rel}(m_{i},t_{1}),\dots,R_{rel}(m_{i},t_{N_{t}}))$ . 5. 5.

$\text{tag-pop}:T\longrightarrow\mathbb{N}$ : Function representing the popularity of a tag, measured as the number of users who applied the tag $t_{j}\in T$ . 6. 6.

$\text{tag-spec}:T\longrightarrow\mathbb{N}$ : Function representing the movie frequency of tag $t_{j}\in T$ , i. e., $\text{tag-spec}(t_{j})\coloneqq\sum_{m_{i}\in M}\llbracket R_{rel}(m_{i},t_{j})>0.5\rrbracket$ denotes the number of movies for which the relevance of tag $t_{j}$ is greater than $0.5$ . 7. 7.

$P$ : The set of top $20$ most popular-tags $P\subset T$ based on the popularity tag-pop.

The weighted cosine similarity is a similarity measure defined in [117] to measure the similarity between two movies. The weight vector $\boldsymbol{w}$ is defined in such a way that more weight is assigned to both the popular tags because this implies that more users care about these tags and also to more specific tags because they can uniquely identify the similarity. For example, if two movies have the harry potter tag in common, they are more likely to be similar than the ones that have the tag fantasy in common [117]. A $\log$ -transform is applied to both values to bring them closer to the normal distribution. The weighted cosine similarity between two movies his defined as:

[TABLE]

where $\boldsymbol{x}_{i}=\mathcal{M}_{f}(m_{i}),\enspace\boldsymbol{x}_{j}=\mathcal{M}_{f}(m_{j})$ and $w_{k}=\frac{\log(\text{tag-pop}(t_{k}))}{\log(\text{tag-spec}(t_{k}))}$ for any $t_{k}\in T$ .

To construct the singleton choice semisynthetic dataset, we sample uniformly at random $n$ movie items from $M$ to create a task set $Q$ , and we choose the medoid $\boldsymbol{r}$ of $Q$ as the reference movie.

We define two tasks based on the reference movie $\boldsymbol{r}$ of the sampled task set $Q$ . The first task is to choose the most similar movie to the reference movie in task set $Q$ . The second task is to choose the most dissimilar movie with respect to the reference movie $\boldsymbol{r}$ for a given task set $Q$ . This problem is similar to finding the outliers for a given set of objects which can be used to solve the problem of anomaly detection [1, 21]. Both tasks used to generate semisynthetic datasets depend on the similarity between all objects in the given task set $Q$ , thus making the datasets highly context-dependent.

Data Generation Process

We explain the data generation process for the Tag Genome Similar Movie and Tag Genome Dissimilar Movie datasets. Let us assume we want to generate a singleton choice dataset $\mathcal{D}=\left\{(Q_{i},C_{i})\right\}_{i=1}^{N}$ with $N$ instances. Each task set $Q_{i}$ and its corresponding singleton choices $C_{i}$ is constructed in the following way:

Sample i.i.d. and uniformly at random $m_{1},\dots,m_{n}$ from $M$ , let $\boldsymbol{x}_{i,n}\coloneqq\mathcal{M}_{f}(m_{j})$ for each $j\in[n]$ and $Q_{i}\coloneqq\{\boldsymbol{x}_{i,1},\dots,\boldsymbol{x}_{i,n}\}$ . 2. 2.

Compute the reference object (movie) for $Q_{i}$ (medoid):

[TABLE] 3. 3.

Now we define the corresponding singleton choices $C_{1},\dotsc,C_{N}$ for Tag Genome Similar Movie and Tag Genome Dissimilar Movie dataset.

(a)

The singleton choice set $C_{i}$ for $Q_{i}$ for Tag Genome Dissimilar Movie is the set consisting of only that element of $Q_{i}$ , which is most dissimilar to $\boldsymbol{r}$ , i. e., formally

[TABLE] 2. (b)

For the Tag Genome Similar Movie dataset, we select for the task $Q_{i}$ the singleton choice set

[TABLE]

which consists of the one element from $Q_{i}$ , that is most similar to $\boldsymbol{r}$ .

Appendix F Real-World Datasets

Some widely used benchmark-datasets available for solving this task are LETOR and SUSHI [89, 58]. In the following sections, we briefly describe these datasets and the process we use to generate singleton and subset choice datasets.

F.1 LETOR Datasets

LETOR777Version 4.0 is a package of benchmark datasets released by Microsoft Research Asia, which are used to compare and evaluate different learning algorithms in the field of preference learning [89]. We use the datasets MQ $2007$ and MQ $2008$ released for learning the task of partial ranking to create the subset choice dataset. There are other datasets MQ $2007$ -list and MQ $2008$ -list released for learning the task of complete ranking888These datasets are available on https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/ to create the singleton choice dataset.

LETOR Supervised Datasets

The datasets (MQ $2007$ and MQ $2008$ ) consist of the queries and retrieved documents, with individual preferences in the form of a relevance for each document with respect to the corresponding query [89]. The format of both datasets (MQ $2007$ and MQ $2008$ ) is the same, and there are about $1500$ queries in MQ $2007$ and about $500$ in MQ $2008$ with labelled documents. These datasets consist of $46$ features extracted from a query and document constructing an object called query-document and each pair is labelled with a relevance score in $\left\{0,1,2\right\}$ , indicating how relevant the document is to the respective query as shown in Figure 14(a). A relevance score of [math] means that the document is not relevant, $1$ means relevant and $2$ means very relevant to the query. For this dataset, the goal of the choice problem is to choose all the relevant documents for the given task.

Structure

The dataset consists of a universal set of objects $\boldsymbol{x}\in\mathcal{X}$ . Each instance of these datasets $\mathcal{D}_{S}=\left\{(\tilde{Q}_{1},\boldsymbol{l}_{1}),\ldots,(\tilde{Q}_{N},\boldsymbol{l}_{N})\right\}$ , is represented as set of tuples $(\tilde{Q}_{i},\boldsymbol{l}_{i})$ , where $\tilde{Q}_{i}=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n}\}$ is the task set ( $\boldsymbol{x}_{i}$ features extracted from query-document) and $\boldsymbol{l}_{i}=(l_{1},\ldots,l_{n})$ represents vector of relevance label for the given set of objects, such that $\boldsymbol{x}_{j}\in\mathbb{R}^{46},\enspace l_{j}\in\left\{0,1,2\right\}$ for all $j\in[n]$ and $5\leq\lvert\tilde{Q}_{i}\rvert\leq 147$ for every $i\in[N]$ .

The size of the universal set of objects in the MQ $2007$ dataset is $59\,570$ , i. e., $\lvert\mathcal{X}\rvert=59570$ and the MQ $2008$ dataset is $564$ , i. e., $\lvert\mathcal{X}\rvert=12102$ . These datasets have been partitioned into $5$ parts by [89], such that $\mathcal{D}_{S}=\mathcal{D}_{S1}\cup\mathcal{D}_{S2}\cup\mathcal{D}_{S3}\cup\mathcal{D}_{S4}\cup\mathcal{D}_{S5}$ . This partition is used to conduct $5$ -fold cross-validation, and for each fold, we use four parts for training and the remaining part for testing as described in Table 5.

Choice Data Conversion

The corresponding choice dataset is created by considering the documents in $\tilde{Q}_{i}$ as the task sets $Q_{i}$ and the set of relevant documents $C_{i}\coloneqq\left\{\boldsymbol{x}_{j}\in\tilde{Q}_{i}:l_{j}\in\left\{1,2\right\}\right\}$ as the corresponding choice set for each instance $(\tilde{Q}_{j},\boldsymbol{l}_{j})\in\mathcal{D}_{S}\setminus\mathcal{D}_{Si}$ . For training the choice model, we sub-sample $10$ objects from each query instance $\tilde{Q}_{i}$ to construct the task sets. Note, that we still evaluate the models on the corresponding test choice dataset, which consists of all original queries for each fold as described in Table 5.

LETOR Listwise Datasets

The format of both listwise datasets is the same as the supervised one. There are about $1700$ queries in MQ $2007$ -list and about $800$ queries in MQ $2008$ -list with each query-document pair consisting of $46$ features. In this dataset, all the documents for each query are labelled with a real-valued relevance score instead of the multiple level relevance judgments as shown in Figure 14(b). The documents on top positions in the ground truth permutation have larger value of the relevance degree.

Structure

The dataset consists of a universal set of objects $\boldsymbol{x}\in\mathcal{X}$ . Each instance of these datasets $\mathcal{D}_{L}=\left\{(\tilde{Q}_{1},\boldsymbol{l}_{1}),\ldots,(\tilde{Q}_{N},\boldsymbol{l}_{N})\right\}$ , is represented as a set of tuples $(\tilde{Q}_{i},\boldsymbol{l}_{i})$ , where $\tilde{Q}_{i}=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n}\}$ is the task set ( $\boldsymbol{x}_{i}$ features extracted from query-document) and $\boldsymbol{l}_{i}=(l_{1},\ldots,l_{n})$ represents a vector of relevance score for the given set of objects, such that $\boldsymbol{x}_{j}\in\mathbb{R}^{46},\enspace l_{j}\in\mathbb{R}$ for all $j\in[n]$ and $204\leq\lvert\tilde{Q}_{i}\rvert\leq 1831$ for every $i\in[N]$ .

Singleton Choice Data Conversion

The corresponding singleton choice datasets are created by considering the documents in $\tilde{Q}_{i}$ as the task sets $Q_{i}$ and the most relevant document $C_{i}=\left\{\operatorname*{\arg\,\max}_{\boldsymbol{x}_{j}\in\tilde{Q}_{i}}l_{j}\right\}$ as the corresponding singleton choice set for each instance $(\tilde{Q}_{j},\boldsymbol{l}_{j})\in\mathcal{D}_{L}\setminus\mathcal{D}_{Li}$ . For training the SCM we sub-sample $10$ objects from each query instance $\tilde{Q}_{i}$ to construct the task sets. Note that we still evaluate the models on the corresponding singleton choice test dataset, which consists of all original queries for each fold as described in Table 5.

F.2 Expedia Hotel Dataset

Expedia released a dataset on the Kaggle website as a competition and for research purposes999These datasets are available on https://www.kaggle.com/c/expedia-personalized-sort/data. The dataset includes browsing and booking data as well as information on price competitiveness. The data are organized around a set of search result impressions, the ordered list of hotels that the user sees after they search for a hotel on the Expedia website. In addition to impressions from the existing algorithm, the dataset contains impressions where the hotels were randomly sorted, to avoid the position bias of the existing algorithm. The user response is provided as a click on a hotel and/or a purchase of a hotel room. This dataset consists of $399\,344$ search queries and $45$ features extracted from the search query and the hotel constructing an object. Each hotel is labelled with a relevance score of [math], $1$ or $2$ , indicating how relevant the hotel is to the respective query or the user. A relevance score of [math] means that the hotel is not clicked, $1$ means it was clicked and $2$ means the hotel was booked by the user. This dataset is very similar to the LETOR dataset as shown in Figure 14. For this dataset, we define the learning target to be the set of relevant hotels (clicked and/or booked). Since for each query, the number of hotels displayed is different, this dataset consists of different task sizes.

Structure

The dataset consists of a universal set of objects $\boldsymbol{x}\in\mathcal{X}$ . Each instance of the datasets $\mathcal{D}_{E}=\left\{(\tilde{Q}_{1},\boldsymbol{l}_{1}),\ldots,(\tilde{Q}_{N},\boldsymbol{l}_{N})\right\}$ , is represented as a set of tuples $(\tilde{Q}_{i},\boldsymbol{l}_{i})$ , where $\tilde{Q}_{i}=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n}\}$ is the task set ( $\boldsymbol{x}_{i}$ features extracted from hotel) and $\boldsymbol{l}_{i}=(l_{1},\ldots,l_{n})$ represents the vector of relevance label for the given set of objects, such that $\boldsymbol{x}_{j}\in[-1,\infty]^{45},\enspace l_{j}\in\left\{0,1,2\right\}$ for each $j\in[n]$ and $5\leq\lvert\tilde{Q}_{i}\rvert\leq 38$ for all $i\in[N]$ .

The number of instances $N$ in this dataset is $399\,344$ , i. e., $\lvert\mathcal{D}_{E}\rvert=399344$ and the size of the universal set of objects (hotels) is $136\,886$ , i. e., $\lvert\mathcal{X}\rvert=136886$ . There are $31$ features which have missing values, and we removed the features which consist of more than $50\text{\,}\mathrm{\char 37\relax}$ missing values. For the remaining $3$ features which have of missing values, we impute them with a negative value less than $-1$ . The models are trained on the resulting dataset with $17$ features.

Data Conversion Process

We create $5$ folds by shuffle-splitting the dataset randomly into $80\text{\,}\mathrm{\char 37\relax}$ test and $20\text{\,}\mathrm{\char 37\relax}$ train instances. The choice dataset is created by considering the hotels in $\tilde{Q}_{i}$ as the task set $Q_{i}$ and the set of relevant hotels $C_{i}\coloneqq\{\boldsymbol{x}_{j}\in\tilde{Q}_{i}:l_{j}\in\left\{1,2\right\}\}$ as the corresponding choice set for each instance $(\tilde{Q}_{i},\boldsymbol{l}_{i})\in\mathcal{D}_{E}$ . The models are trained on the sampled training dataset and corresponding test dataset using $5$ -fold stratified cross-validation as described in Table 7.

Singleton Choice

In order to create the singleton choice dataset, we just consider the samples where the user booked the hotel, which is the singleton choice for the given query. The singleton choice dataset is created by considering the hotels in $\tilde{Q}_{i}$ as the task set $Q_{i}$ and the set of booked hotels $C_{i}=\bigl{\{}\boldsymbol{x}_{j}\in\tilde{Q}_{i}:l_{j}=2\bigr{\}}$ as the corresponding choice set for each instance $(\tilde{Q}_{i},\boldsymbol{l}_{i})\in\mathcal{D}_{E}$ .

The models are trained on the sampled training dataset and corresponding test dataset using $5$ -fold stratified cross-validation as described in Table 7. Note, the instances where the hotel was not booked at all were discarded and only the instances where there was booking were considered.

F.3 SUSHI Dataset

SUSHI101010This dataset can be downloaded from http://www.kamishima.net/sushi/ was another dataset released for solving the task of object ranking. This dataset was collected by surveying $5000$ individuals, such that each person was provided with two item sets $A$ and $B$ . Set $A$ consist of $10$ most famous sushi and $B$ consists of top $100$ sushi famous in Japan. Individuals were asked to provide the preferences in form total order for items in set $A$ , and a real numbered score between [math] and $5$ for sushi in set $B$ . There were missing rating values for many items in set $B$ , so they extracted the total order for the top $10$ preferred items by each user.

The SUSHI dataset consists of universal set of objects $\boldsymbol{x}\in\mathcal{X}$ , with size $100$ , i. e., $\lvert\mathcal{X}\rvert=100$ , with $10\,000$ set of object $Q$ of size $10$ and each sushi consists of $7$ features, i. e., $\boldsymbol{x}\in\mathbb{R}^{7}$ . The instances of the dataset $\mathcal{D}_{S}=\left\{(Q_{1},\pi_{1}),\ldots,(Q_{N},\pi_{N})\right\}$ , are represented as a set of tuples $(Q_{i},\pi_{i})$ , where $Q_{i}=\{\boldsymbol{x}_{1},\ldots,\boldsymbol{x}_{n}\}$ is the set of objects and $\pi_{i}$ represents the underlying orderings for the given set of objects $Q_{i}$ , such that $N=\lvert\mathcal{D}_{M}\rvert=10000$ , $\boldsymbol{x}_{i}\in\mathbb{R}^{7}$ and $\lvert Q_{i}\rvert=10$ holds for all $i\in[N]$ .

The dataset contains the following features:

Style: This is a binary feature, which describes whether the sushi is a Maki or other, where [math] means Maki sushi and $1$ means others. 2. 2.

Major Group: This is a binary feature, which describes whether it is listed as a seafood ([math]) or not ( $1$ ). 3. 3.

Minor group: Described the species group used to prepare the suchi. The group is denoted by the categorical value between [math] and $11$ , i.e. it lies in the set $\left\{0,1,2,3,4,5,6,7,8,9,10,11\right\}$ . Refer to Table 8 for description of each group. 4. 4.

Oiliness/Heaviness: The amount of oil or fat present in the sushi, expressed as a real number between [math] and $4$ , where [math] indicates heavy/oil and $4$ oil-free. 5. 5.

Demand: The frequency with which the user demands the sushi, expressed as a real number between [math] and $3$ , where $3$ means most frequently and [math] not at all. 6. 6.

Normalized Price: The price of sushi normalized over the given $100$ sushis. 7. 7.

Supply: The frequency of selling a sushi in the shop, expressed as a real number between [math] and $1$ , where [math] indicates not at all and $1$ frequently.

Singleton Choice Data Conversion

For using the SUSHI dataset for singleton choice setting, we re-utilize the set of object $Q$ in $\mathcal{D}_{S}$ and choose the most preferred object as the singleton choice. We created the singleton choice dataset $\mathcal{D}_{SDC}=\left\{(Q_{1},C_{1}),\dotsc,(Q_{N},C_{N}\right\}$ with $N=\lvert\mathcal{D}_{S}\rvert$ instances, such that $\lvert Q_{k}\rvert=10$ and $C_{k}\coloneqq\left\{\boldsymbol{x}_{\pi_{i}(1)}\right\}$ for all $k\in[N]$ . The singleton choice models are evaluated using $5$ -folds by train-test shuffle-split with 80 % train and 20 % test instances.

Appendix G Detailed Experimental Results

The following Tables 9, 10 and 11 contain all experimental results as discussed in Section 6.4 in numeric form for additional evaluation measures.

Bibliography127

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Charu C Aggarwal and Philip S Yu “Outlier detection for high dimensional data” In ACM Sigmod Record 30.2 , 2001, pp. 37–46 ACM
2[2] Qingyao Ai, Keping Bi, Jiafeng Guo and W. Croft “Learning a Deep Listwise Context Model for Ranking Refinement” In SIGIR ACM, 2018, pp. 135–144
3[3] Qingyao Ai et al. “Learning Groupwise Multivariate Scoring Functions Using Deep Neural Networks” In ICTIR ACM, 2019, pp. 85–92
4[4] Pavel Anselmo Alvarez, Alessio Ishizaka and Luis Martínez “Multiple-criteria decision-making sorting methods: A survey” In Expert Systems with Applications 183 , 2021, pp. 115368
5[5] Attila Ambrus and Kareen Rozen “Rationalising Choice with Multi‐Self Models” In The Economic Journal 125.585 , 2014, pp. 1136–1156 DOI: 10.1111/ecoj.12103 · doi ↗
6[6] Kenneth J Arrow “Social Choice and Individual Values” John Wiley & Sons, 1951
7[7] Richard R. Batsell and John C. Polking “A New Class of Market Share Models” In Marketing Science 4.3 INFORMS, 1985, pp. 177–198 URL: http://www.jstor.org/stable/183903
8[8] Peter W. Battaglia et al. “Relational inductive biases, deep learning, and graph networks” In Co RR abs/1806.01261 , 2018

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Code & Models

Videos

Taxonomy

Learning Context-Dependent Choice Functions

Abstract

1 Introduction

2 Related Literature

3 A Probabilistic Model of Choice

Utility-Based Choices

4 Learning Context-Dependent Choice Functions

4.1 First Evaluate Then Aggregate

Proposition 4.2**.**

Proof**.**

Claim 4.2.1**.**

Proof**.**

Corollary 4.3**.**

Proposition 4.4**.**

Proof**.**

4.2 First Aggregate Then Evaluate

Proposition 4.6**.**

Proof**.**

4.3 Linear Sub-Utility Functions

5 Implementation Using Neural Networks

5.1 FETA-Net Architecture

5.2 FATE-Net Architecture

6 Empirical Evaluation

6.1 Setup

6.2 Loss Functions

6.2.1 Target Loss Functions

6.2.2 Surrogate Losses

Convexity of the Surrogate Losses

6.3 Datasets

6.3.1 The Medoid Problem

6.3.2 The Pareto-Front Problem

6.3.3 Hypervolume

6.3.4 MNIST Number Problems

6.3.5 MovieLens Tag Genome

6.3.6 LETOR

6.3.7 Expedia

6.3.8 SUSHI

6.4 Results and Discussion

6.4.1 Singleton Choice

6.4.2 Subset Choice

6.4.3 Generalization Across Task Sizes

7 Conclusion and Future Work

Acknowledgements

Appendix A Notation

Appendix B Evaluation Measures

B.1 Singleton Choice

Top-kkk Categorical Accuracy

Categorical Accuracy

Normalized Accuracy

B.2 Subset Choice

Subset 0/10/10/1 Accuracy

Recall

Precision

F1F_{1}F1​-Measure

Informedness

AUC-ROC

Appendix C Additional Experimental Details

Empirical Comparison

Hyperparameter Optimization

Out-of-Sample Evaluation

Hyperparameters & Inference

Threshold Tuning

Appendix D Design of the Generalization Experiment

Appendix E Synthetic Datasets

E.1 The Medoid Problem

E.2 The Pareto Problem

Hypervolume

E.3 MNIST Number Problems

Feature Extraction

Unique

Mode

E.4 Tag Genome Dataset

Proposition 4.2.

Proof.

Claim 4.2.1.

Proof.

Corollary 4.3.

Proposition 4.4.

Proof.

Proposition 4.6.

Proof.

Top- $k$ Categorical Accuracy

Subset $0/1$ Accuracy

$F_{1}$ -Measure