Learning Context-Dependent Choice Functions
Karlson Pfannschmidt, Pritha Gupta, Bj\"orn Haddenhorst, Eyke, H\"ullermeier

TL;DR
This paper introduces models for learning context-dependent choice functions using neural networks, addressing challenges like variable input size and order invariance, with extensive empirical validation on synthetic and real data.
Contribution
It proposes a novel framework for modeling context-dependent preferences via utility functions and develops neural network architectures to learn these functions effectively.
Findings
Neural network models outperform baselines on synthetic datasets.
Models demonstrate strong generalization to real-world choice data.
Approaches handle variable input sizes and order invariance effectively.
Abstract
Choice functions accept a set of alternatives as input and produce a preferred subset of these alternatives as output. We study the problem of learning such functions under conditions of context-dependence of preferences, which means that the preference in favor of a certain choice alternative may depend on what other options are also available. In spite of its practical relevance, this kind of context-dependence has received little attention in preference learning so far. We propose a suitable model based on context-dependent (latent) utility functions, thereby reducing the problem to the task of learning such utility functions. Practically, this comes with a number of challenges. For example, the set of alternatives provided as input to a choice function can be of any size, and the output of the function should not depend on the order in which the alternatives are presented. To meet…
| Problem | Dataset | # Train | # Test | # Features | |
|---|---|---|---|---|---|
| Singleton Choice | Medoid | ||||
| Hypervolume | |||||
| MNIST-Mode | |||||
| MNIST-Unique | |||||
| Tag Genome Dissimilar Movie | |||||
| Tag Genome Similar Movie | |||||
| LETOR-MQ-list | |||||
| LETOR-MQ-list | |||||
| Expedia | |||||
| Sushi | |||||
| Subset Choice | Pareto-front-D | ||||
| Pareto-front-D | |||||
| MNIST-Mode | |||||
| MNIST-Unique | |||||
| LETOR-MQ | |||||
| LETOR-MQ | |||||
| Expedia |
| Problem | Dataset | # Features | # Train | # Test | Task set sizes | Task set Size |
|---|---|---|---|---|---|---|
| Singleton Choice | Medoid | |||||
| Hypervolume |
| Major Group | |||||
|---|---|---|---|---|---|
| Value | Species | Value | Species | Value | Species |
| Aomono (blue-skinned fish) | Clam or shell | Other seafood | |||
| Akami (red meat fish) | Squid or octopus | Egg | |||
| Shiromi (white-meat fish) | Shrimp or crab | Meat other than fish | |||
| Tare (something like baste for eel) | Roe | Vegetables | |||
| Dataset | SCM | Accuracy | Top- | Top- |
|---|---|---|---|---|
| Medoid | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML | ||||
| Hypervolume | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML | ||||
| MNIST-Unique | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML | ||||
| MNIST-Mode | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML | ||||
| Tag Genome Similar Movie | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML |
| Dataset | SCM | Accuracy | Top- | Top- |
|---|---|---|---|---|
| Tag Genome Dissimilar Movie | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML | ||||
| LETORMQ-list | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML | ||||
| LETORMQ-list | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML | ||||
| Expedia | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML | ||||
| SUSHI | FETA-Net | |||
| FATE-Net | ||||
| FETA-Linear | ||||
| SDA | ||||
| RankNet | ||||
| PairwiseSVM | ||||
| MNL | ||||
| NL | ||||
| GNL | ||||
| ML |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic and Environmental Valuation · Bayesian Modeling and Causal Inference · Multi-Criteria Decision Making
\newwatermark
[firstpage,color=gray!60,angle=270,scale=0.4, xpos=3.8in,ypos=0]Publication DOI
Learning Context-Dependent Choice Functions
Karlson Pfannschmidt[
ID
](https://orcid.org/0000-0001-9407-7903)
Pritha Gupta[
ID
](https://orcid.org/0000-0002-7277-4633)
Björn Haddenhorst[
ID
](https://orcid.org/0000-0002-4023-6646)
Eyke Hüllermeier[
ID
](https://orcid.org/0000-0002-9944-4108)
Abstract
Choice functions accept a set of alternatives as input and produce a preferred subset of these alternatives as output. We study the problem of learning such functions under conditions of context-dependence of preferences, which means that the preference in favor of a certain choice alternative may depend on what other options are also available. In spite of its practical relevance, this kind of context-dependence has received little attention in preference learning so far. We propose a suitable model based on context-dependent (latent) utility functions, thereby reducing the problem to the task of learning such utility functions. Practically, this comes with a number of challenges. For example, the set of alternatives provided as input to a choice function can be of any size, and the output of the function should not depend on the order in which the alternatives are presented. To meet these requirements, we propose two general approaches based on two representations of context-dependent utility functions, as well as instantiations in the form of appropriate end-to-end trainable neural network architectures. Moreover, to demonstrate the performance of both networks, we present extensive empirical evaluations on both synthetic and real-world datasets.
K****eywords preference learning choice functions context-dependence neural networks
1 Introduction
The notion of preference plays a central role in various scientific disciplines, such as economics, psychology, and more recently also computer science and artificial intelligence [19]. In these fields, mathematical formalisms have been developed for modelling and reasoning about preferences, and for analyzing data that originates from observed or revealed preferences. In this regard, choice observations are of specific interest, in which a subset of “good” alternatives is selected from a set of available candidates. In particular, starting with the seminal work by [6], choice functions have been analyzed as a key concept of a formal theory of choice and preference. The study of pairwise preferences even goes back to work by [36], who considered the varying perception of different stimuli.
In machine learning, preferences are at the core of preference learning, which has received increasing attention in recent years [40]. Roughly speaking, the goal in preference learning is to learn (predictive) preference models from preference data. Somewhat surprisingly, and in spite of a close connection between ranking and choice, the problem of learning subset choice functions has received very little attention so far, with only a few notable exceptions [10, 109]. In this paper, we therefore address the problem of learning choice functions, which express preferences in terms of subsets (or equivalently, bipartitions) of . From a machine learning point of view, the problem of learning choice functions comes with a number of challenges. For example, while algorithms for supervised learning normally assume inputs in the form of feature vectors of fixed length, the inputs in our setting are neither vectors nor of fixed size. Instead, a choice function is supposed to accept inputs in the form of sets of any size, and to return a subset (choice) of the elements as output. In case a set is represented by an ordered list of its elements, a choice function thus has to be invariant with respect to permutations of its input.
Not less interestingly, and in fact the key motivation of this paper, choice functions could be context-dependent, in the sense that the preference in favor of an alternative may depend on what other options are available. Context-dependence of this kind has been observed, for example, in marketing studies [26, 13], and has been investigated systematically in fields like economics and psychology. More specifically, three major context effects have been identified in the literature, the compromise effect [103], the attraction effect [52], and the similarity effect [113]:
- •
The compromise effect states that the relative utility of an object increases by adding an extreme option that makes it a compromise in the set of alternatives [94]. For instance, consider the set of objects in Figure 1(a). The ordering of these objects depends on how much the consumer is weighing the quality of the product in relation to its price. If price is the main constraint, then the preference order will be . But as soon as another extreme option becomes available, object may be considered more favorable, because it represents a compromise between the three alternatives. Thus, the preference relation between and might get inverted and turned into .
- •
Figure 1(b) illustrates the attraction effect. Here, if we add another object to the set of objects , where is slightly dominated by , the relative utility share for object increases with respect to . The major psychological reason is that consumers have a strong preference for dominating products [52]. Thus, the preference relation between and may again be influenced.
- •
The similarity or substitution effect is another phenomenon, according to which the presence of similar objects tends to reduce the overall probability of an object to be chosen, as it will divide the loyalty of potential consumers [52]. In Figure 1(c), and are two similar objects. Consumers who prefer high quality will be divided amongst the two objects, resulting in a decrease of the relative utility share of object . Again, this may lead to turning a preference into , at least on an aggregate (population) level, if preferences are defined on the basis of choice probabilities.
Context-dependence as explained above has received only limited consideration in the machine learning literature until recently [22, 85, 10, 101, 95, 15, 63].
Additionally, the context effects discussed so far focus on effects that have been observed for humans, but ignore that the space of (subset) choice functions and thus the number of possible applications is much larger. Many algorithmic problems can be framed as a choice problem, e. g., in the Knapsack problem one is tasked in choosing a set of maximal utility while obeying capacity constraints. Computing the medoid of a set of points (i. e., the point with minimal distance to each other point) is a singleton choice problem. It is clear that these problems cannot be solved by considering each choice alternative individually, but the complete choice context needs to be incorporated. In practice, there are many abstract choice problems similar to these, e. g., portfolio selection [72], algorithm selection [91, 14] and team selection [119] just to name a few. All these problems have in common, that the context-dependence naturally arises because the output depends jointly on all objects in the set and not because a decision maker behaves rationally or irrationally.
Motivated by its practical relevance, we formalize the problem of learning context-dependent choice functions. To this end, we provide a formal definition of such functions and propose a data-generating process consisting of two stages: First, choice alternatives are scored in terms of latent utility degrees, and then, a choice set is determined on the basis of these scores (Section 3). Based on this model, we propose two representations of the latent (context-dependent) utility, called First Evaluate Then Aggregate (FETA) and First Aggregate Then Evaluate (FATE), which have appealing properties from a learning point of view (Section 4), as well as realizations of these models in terms of neural network architectures (Section 5). Thanks to these architectures, called FETA-Net and FATE-Net [85], we are able to learn subset choices on sets of objects in an end-to-end trainable manner. To demonstrate the performance of both networks, we present extensive empirical evaluations on both synthetic and real-world choice datasets (Section 6). Additional information and supplementary material is provided in an appendix, to which we will refer occasionally.
2 Related Literature
The problem of how to model preferences in general has been extensively studied from different viewpoints in the past. From an axiomatic/normative perspective, one posits which properties have to hold for preferences to be considered “rational,” and studies consequences of these properties. Luce’s choice axiom was introduced in \citeyearluce1959 by [67] and requires that the preference between two items does not depend on the presence or absence of any other choice alternative, a property commonly referred to as independence of irrelevant alternatives (IIA). The set of objects from which a particular preference is observed is also called the context [77, 20, 94], and thus preferences obeying IIA are also called context-independent [60]. In the same year, [27, pp. 56f ] proved the ordinal representation theorem, which shows that preferences can be represented by a continuous utility function, if certain conditions including transitivity are assumed to hold. A related line of research was concerned with the concept of revealed preferences, for which most axioms can be reduced to some notion of transitivity [98, 50, 100].
On the other side of the spectrum, observational studies in economics and psychology were more concerned with how humans actually behave, and studied how the observed behavior deviates from IIA [114, 113, 51, 52, 103, 82, 104, 102, 115, 30, 83, 99]. It consistently was observed that choice behavior depended on the specific collection of alternatives available, the context of the choice. [94] and [92] provide an extensive overview of the different context effects which were identified over the years and which we already showcased in the introduction. This motivated researchers to come up with methods able to model these violations. Classical random utility models, like the multinomial logit (MNL) model, are not able to take these effects into account. Therefore, extensions of RUMs were proposed, which are able to capture the compromise and attraction effect [115, 61, 80], the similarity effect [113, 56] or all of the above [94]. One important line of research focuses on the assumption that the decision maker chooses based on multiple utility functions (so called “multiple selfs”, or “multi-self” for short), which are suitably aggregated. This setting has been studied in economics [73, 55, 61, 39, 71, 45] and psychology [113, 102, 115]. Continuing this line of research, [5] show that by utilizing a collection of context-independent utility functions, combined with a suitable aggregation, one is able to model arbitrary choice functions. That is, choice behavior across multiple sets can be modelled even though it might violate context-independence.
While traditional research on preferences, as discussed above, is mostly of a normative, prescriptive or descriptive nature, the advent of machine learning triggered a shift towards “predictive” models. [95] build on ideas of the multi-self literature and propose to learn set-dependent weights and embeddings, which are then linearly combined to arrive at an aggregated score for each object. [10] consider the problem of learning preferences in the form of subsets of objects. To this end, they extend the classical multinomial logit model to account for violations of context-independence. Higher-order interactions between objects are added specifically for those subsets that cause a violation. The set of objects for which choices or choice sets are observed is assumed to be fixed. Therefore, the approach cannot be used for arbitrary task sets, where it can happen that an object is only observed once. Our approach to decompose a context-dependent utility function into an aggregation across smaller sub-contexts has been a recent, promising direction in studying choices [85, 101], and will be the focus of this paper.
Decomposition approaches have also been employed in the related field of “learning to rank”. [2] employ a context-independent model to pre-sort the objects, while a recurrent neural network is used in a subsequent step to fine-tune the ranking. The FATE approach, introduced in the context of choice by [85], obviates the need to pre-sort the objects, by directly embedding each object to produce a representation for each set of objects (aggregation), which is then used as the context to produce the final ranking (evaluation). The authors also introduce an algorithm where this order is swapped, called FETA, in which each object is scored in the context of another object first, and only then the scores are aggregated to produce a final ranking. [3] later consider a similar decomposition, where higher order interactions are approximated by employing sampling.
3 A Probabilistic Model of Choice
We start by establishing the necessary notation (refer to Appendix A for an overview). Throughout this paper, is defined to be if is a true statement, and [math] otherwise. We will denote by a set of reference objects serving as choice alternatives, which, for simplicity, we assume to be finite (albeit of arbitrary size), if not explicitly stated otherwise. An object or item is represented by a vector of features . A non-empty subset of is called a choice task space if and any is called a choice task. A choice for is a non-empty subset of and the set of choices for any is called the choice space.
We say that a function is a (subset) choice function (for ) if is fulfilled for any , and in case holds for any , is called a singleton choice function (for ). A typical example for a real-world singleton choice function is when a user enters a query in a search engine and receives a list of results () of which they pick one and click on. Subset choice functions usually occur, when a diverse set of objects is sought, e. g., a search engine decides on a set of the most relevant, but diverse, results to display to the user.
As common in machine learning, the input-output dependency of interest, in our case between tasks and choices, is not assumed to be deterministic. Instead, we assume a probabilistic dependence, which is captured by a (conditional) probability distribution on the non-empty subsets of for every . Here, is interpreted as the probability to observe the choice given the task . For the sake of convenience, we suppose w.l.o.g. to be extended to via for any . Moreover, we write for short for . In case is the latent probability that is given as task, the whole data-generating process is modelled by the joint distribution
[TABLE]
on .
We call the choice probabilities context-independent if
[TABLE]
is fulfilled for every and any with . Conversely, we say that a system of choice distributions is context-dependent, if this equality is violated on at least one pair of . This definition extends in a straight-forward and consistent way the notion of independence of irrelevant alternatives (IIA) introduced by [6], which was originally only defined for the case of singleton choice, in which consists of elements of size one only. We choose to use the more general term of context-(in)dependence, for the simple reason that the notion of “irrelevant” alternatives is rather tailored to the analysis of human choices but less meaningful in our more general setting of arbitrary choice functions.
As an example, consider the knapsack problem, where the goal is to select a set of objects which maximize a certain utility, while obeying capacity constraints. It is clear that the decision on which object to include in the choice set needs to incorporate the complete choice task context, and that one is not able to ascertain the relative choice probability of two alternatives while ignoring all others. As already explained in the introduction, context-independence is often violated in practice. This motivates the development of context-dependent learning methods.
Utility-Based Choices
We propose to model choices as the result of a two-stage process (cf. Figure 2 for an overview), grounding them on the notion of utility: In the first stage, each object in a given task is assigned a real-valued utility score. Then in the second stage, choices are generated based on these scores.
Utility theory has a long history in economics [120, 25, 73]. Originally introduced as a way to measure the satisfaction achieved by a certain alternative [11], it is nowadays common in decision theory to consider utility more as an abstract value that ought to be maximized by any rational decision maker [120, 96]. This is formalized by means of a generalized utility function (for )
[TABLE]
which allows for modelling the utility of an object as a function of both, properties of the object itself as well as properties of other choice alternatives in , which constitute the context in which is considered: expresses a degree of utility of in the context , i. e., given the availability of other choice alternatives . The score is supposed to capture an abstract notion of utility, which in turn reflects the propensity of to be chosen in any task .
We call a utility function context-independent in case holds for any with and context-dependent otherwise. Via abbreviating for some arbitrary with , any context-independent utility function may be thought of as a function .
Moving on to the second stage, based on a utility function , one may define in a deterministic manner for the corresponding singleton choice as
[TABLE]
and for the subset choice (with threshold ) as
[TABLE]
Clearly, and are in fact choice functions and in case is injective (i. e., there are no ties), for any , the former one is a singleton choice function. There is an interesting connection to social choice theory, where a social choice rule is employed to select an outcome out of a set of possible outcomes in order to maximize some notion of utility for a population of individuals with possibly varying utility functions. The injectivity of such a social choice rule is called resoluteness and it is an important property considered in social choice theory, where it also plays a role in several impossibility results [59, 81]. The singleton choice is a special case of the more general top- choice, where the goal is to select the best objects. It differs from subset choice in so far that the size of the choice sets is always fixed, whereas in subset choice it can vary. The top- choice setting has strong connections to the ranking setting, which we will discuss below.
Further note that using thresholding to convert a set of scores into a partition is a standard approach in multi-label classification [64] and multi-criteria sorting [4].
In the probabilistic setting, the utility function may serve to model probabilistic choices , , on by using the utility scores as the corresponding parameters of the distributions. Certainly, there are various ways in which this idea could be realized:
Singleton choice
In the case of singleton choice, a natural assumption is the multinomial logit (MNL) model, in which for any and ,
[TABLE]
and for any of size [12, 46, 24, 70, 108]. Note here that these choice probabilities are context-independent, if is context-independent. An important special case is the Bradley-Terry-Luce model [16], which only considers pairwise comparisons (i. e., for all ).
Subset choice
For the choice of arbitrary subsets (not limited to singleton sets), a simple model is obtained by treating the inclusion or exclusion of each object in a task as independent given the utilities. This results in the distributions given by
[TABLE]
for any non-empty and , where is a constant such that holds. If is context-independent, the quantity
[TABLE]
does not depend on , and thus the choice probabilities are context-independent as well.
Choices based on rankings
Yet another type of model is obtained by assuming that, based on the latent utilities , , a ranking on is sampled first and then turned into a choice set via a (possibly probabilistic) procedure afterwards. The probability is then simply the probability that this procedure results in the output , i. e.,
[TABLE]
where the sum is taken over all possible rankings over . An approach of that kind might be appealing, because probability distributions on rankings have been studied quite thoroughly in the literature. Important families of ranking distributions include distance-based ranking models [37], of which the Mallows model [69] is a popular instance, and multistage ranking models [38], most prominently represented by the Plackett-Luce distribution [86]. An important special case for is top- choice, where the first objects are chosen deterministically (i. e., holds with probability for any ranking ). This can be generalized, for example, by assuming that the size is not fixed but random. An even more general model has recently been proposed by [34], where choices are not necessarily restricted to top- sets.
In this paper, we are mainly interested in tackling the problem of learning context-dependent choice functions from training data. The performance of a particular hypothesis, i. e., a choice function , is measured by an appropriate loss function (see Section 4). In Section 6.2 we go into more detail on how to derive suitable loss functions from (5) and (6). After having introduced suitable models for utility-based choices, we now turn to the problem of representing context-dependent choice functions.
4 Learning Context-Dependent Choice Functions
Our main interest in this paper is to tackle choice from a machine learning perspective. More specifically, we seek to induce a predictive choice function from training data in the form of exemplary tasks together with observed choices . The performance of such a function is measured in terms of its expected loss (risk)
[TABLE]
where is a loss function (cf. Section 6.2 for an overview of the loss functions we consider), and the probability measure associated with the distribution (1), i. e., the underlying data-generating process modelling the probability of observing tasks together with choices . The Bayes predictor assigns each task the respective loss minimizer
[TABLE]
Since is usually unknown, one therefore opts to minimize the empirical risk
[TABLE]
on the given data instead.
Assuming the data to be generated according to one of (3)–(7) (known to the learner) and by means of an (unknown) latent utility function (2), this loss minimization problem essentially comes down to learning the generalized utility function (2). This function, while allowing one to model context-dependence, causes several practical problems, mainly because its second argument, , is a set of variable size.
Many machine learning models such as neural networks or support vector machines require data to be given in the form of a feature vector . Hence, in order to apply such a model for learning a utility function , we have to fix an injective feature transformation .
We choose to represent by the vector . Of course, this does only define a valid transformation in case is the same for each . Assuming this to be the case, we may consider a utility function as a function . Noticing that holds for any bijection , this function should necessarily be permutation-invariant or symmetric in the sense that
[TABLE]
for each permutation on [106].
The utility choice models proposed below will enforce this property and are also capable of dealing with tasks of different sizes. More specifically, we present two general decompositions, which are able to approximate a generalized latent utility function (2). Section 4.1 describes FETA, which decomposes (2) into first- and second-order (or, more generally, higher order) utility functions and aggregates the corresponding scores into an overall utility score. The FATE approach (Section 4.2), on the other hand, first computes an embedding of the complete object context in a space of fixed dimensionality, and evaluates the utility of each object in that space. The former could be advantageous for datasets, of which the choice task contexts can be expressed through local interactions, while the latter is useful, if the set of objects as a whole can be summarized by suitable global properties (e. g., choosing that element of a set, which is closest to the centroid of all elements in this set).
4.1 First Evaluate Then Aggregate
Recall that the overall objective is to model the context-dependent utility function (2), i. e., the utility of each object should not only depend on object attributes, but also on the choice task . One way of handling the problem of rating objects in contexts of variable size is to decompose a context into sub-contexts of a fixed size [85, 101]. More specifically, the idea is to learn sub-utility functions of the form and
[TABLE]
for , and represent the original function (2) as an aggregation
[TABLE]
where is the average over the values for subsets of consisting of distinct elements, i. e., formally
[TABLE]
Note, that the sum is taken w.r.t. to all -sized subsets of , potentially including some in . Here, may be thought of as a measure to which extent an item is preferred to the elements of , and as an indicator of how much is on average preferred to distinct elements from . We refer to this approach as First Evaluate Then Aggregate (FETA), because an alternative is first evaluated in each sub-context, and these evaluations are then aggregated. Accordingly, we call defined in (10) the FETA utility function with sub-utility functions and denote it by .
[7] propose a related expansion in the context of market share modelling. [101] call it an instantiation of the universal logit model, since it can be seen as a generalization of the multinomial logit model (5), when conditioning on the task .
Roughly speaking, the motivation behind the above decomposition is that dependencies and interaction effects between objects should only occur up to a certain order , or at least can be limited to this order without losing too much information. To see what we mean by “order” in this context, observe that the first order model () reduces to and thus only models the inherent utility of each object. A second order model () then introduces pairwise terms. This is an assumption that is commonly made in the literature on aggregation functions [44]. The reason why the utilities are averaged for a fixed , but summed across different , is to give each order equal weight. This prevents the utility from being dominated by higher-order interactions. Furthermore, it allows the sub-utility functions to output scores in roughly the same scale, which is advantageous when the model is applied to choice tasks of varying size.
Given the models of context-dependent choices as outlined above, the learning problem essentially comes down to learning the utility function (10) of order . From this function, one can then derive the utility function (2), which in turn allows for deriving predictions of choices via the choice functions discussed before.
In this paper, we realize (10) for the special case , which can be seen as a second-order approximation of a context-dependent utility function. Thus, we propose the representation of a choice function based on a latent sub-utility function and a pairwise function . In this way, the FETA utility function with sub-utility functions may be written as
[TABLE]
The value can be seen as a kind of inherent, context-independent utility of , whereas the scores , , serve as “corrections” of this utility in the context of the task .
[101] propose a similar approximation, but instead of averaging the task context, the authors simply sum up all utilities and impose sum-to-zero constraints to guarantee identifiability.
As for the FETA model , we will now see that it is identifiable up to the choice of .
Proposition 4.2**.**
Suppose and to be such that for any distinct there is some with . Let and be arbitrary. Then, we have if and only if
[TABLE]
Proof**.**
is clear. For proving the remaining implication , suppose that .
Claim 4.2.1**.**
For any distinct we have
[TABLE]
Proof**.**
For arbitrary with , and we have
[TABLE]
Since this holds for arbitrary (and thus also for ), 4.2.1 follows.
Now, let be fixed for the moment and define via
[TABLE]
According to 4.2.1 we have for any distinct the identity
[TABLE]
Moreover, the definition of assures that holds for any , i. e., already fulfills
[TABLE]
For we may choose a query set and then (12) assures us
[TABLE]
Since holds by definition of , we thus have shown
[TABLE]
With regard to (12) it remains to show
[TABLE]
For this, note that the same argumentation as before with replaced by some arbitrary shows us that also fulfills (12) with replaced by . In particular, (14) holds. Combining (12), (13) and (14) completes the proof.
Corollary 4.3**.**
*Suppose and are as in Proposition 4.2 and let be fixed. Then, the mapping is injective. *
Another interesting theoretical question concerns the expressiveness of the FETA decomposition: Which predictors can be represented by FETA? The following result shows that the decomposition into pairwise utilities (11) is indeed a restriction, in the sense that it does not allow for representing the entire class of predictors in case .
Proposition 4.4**.**
If , not every singleton choice function on can be expressed via the second order FETA model. More precisely: For distinct there do not exist sub-utility functions , and such that the choice function defined either via or via fulfills
[TABLE]
Proof**.**
We prove the statement indirectly. To this end, fix distinct , , , , , , and assume there were some and such that defined either via or via fulfills both (15) and (16). With the convenient abbreviations and , the following constraints for (11) immediately follow from (15):
[TABLE]
Summing up the first two inequalities and then applying the third one yields
[TABLE]
from which we obtain via subtracting common terms
[TABLE]
Exactly the same argumentation (with the roles of and interchanged and resp. replaced by resp. ) lets us infer from (16)
[TABLE]
which contradicts (17). This completes the proof.
Note that a limited expressivity should not necessarily be seen as a negative property. In particular, from a machine learning perspective, an overly excessive expressivity (or capacity of the underlying hypothesis space) is connected with the practical problem of poor generalization due to overfitting, i. e., being overly expressive may prevent the learner from identifying the right model. In any case, we expect FETA to work well for all choice functions that (approximately) decompose into a pairwise relation between objects. Naturally, this leads to the question whether it is possible to incorporate more of the set-based context without ultimately increasing computational complexity. This question motivated our next decomposition.
4.2 First Aggregate Then Evaluate
To deal with the problem of task contexts of variable size, our previous approach was to decompose the context into sub-contexts of a fixed size, evaluate an object in each of the sub-contexts, and then aggregate these evaluations into an overall assessment. An alternative to this FETA strategy, and in a sense contrariwise approach, consists of first aggregating the task into a representation of fixed size, and then evaluating the object in the presence of this task representative.
More specifically, the FATE approach requires a mapping from to some -dimensional embedding space as well as a context-dependent sub-utility function . To evaluate an object in a choice task , the FATE strategy first computes as representative for the task and then evaluates it via as
[TABLE]
We call this the FATE utility function with sub-utility function and transformation and denote it by .
This approach is related to recent advances on dealing with set-valued inputs in neural networks [126, 90, 8], where a permutation-equivariant network directly maps from sets of objects to scores. [95] propose to learn set-dependent aggregation functions with an inductive bias towards principles from behavioral choice theory. They note that general models like Deep Sets [126], which try to approximate set functions using a permutation-invariant neural network, are overly general, because they have a high violation capacity, i. e., the flexibility of the model to change its choices, when objects are removed from the choice task. The FATE approach on the other hand first condenses the task context into a representative and only then scores each object. The resulting model has an inductive bias that favors functions for which the object utility depends on such a set-global reference object. This could be advantageous for datasets where the set of objects as a whole can be summarized by suitable global properties (e. g., choosing that element from a set, which is closest to the centroid of all elements in the set), such that the task to score the objects with this context becomes easy. FETA on the other hand, incorporates task-information through local interactions.
Without further assumptions on and , this model is able to express any possible choice function on , as we show in the following. The proof of the upcoming result is similar to the proof of Theorem 2 by [126].
Proposition 4.6**.**
Suppose to be countable and . There exists a parametrization with the following property:
- (i)
For any singleton choice function on , there is a utility function such that holds for any .
- (ii)
For any subset choice function on there exists a utility function with for any .
Proof**.**
Since is countable, there exists an injective function . For define
[TABLE]
wherein denotes the -th prime number for any . Before proving (i) and (ii), we show that the mapping
[TABLE]
is injective. For this, let with . Then,
[TABLE]
holds for the integers and , i. e., . As and are both products of distinct primes, the uniqueness of the prime factorization lets us infer and thus also .
We proceed with proving (i) and (ii) simultaneously. For this, suppose any choice function on to be fixed. Since from above is injective, there exists a mapping such that holds for any . Note, that is the inverse function of . Thus, the claim follows with the choice
[TABLE]
Although this expressivity is desirable in general, it comes at a cost. The FATE model as such is not identifiable: For example, suppose is of the form for some function , where denotes the standard euclidean norm in . For arbitrary , we obtain with that
[TABLE]
for any , i. e., holds.
4.3 Linear Sub-Utility Functions
A related question concerns the expressivity of the FATE and FETA approaches, when the underlying sub-utility functions and transformations are linear functions. In case and are chosen as linear functions in the sense that and for any and some , and , (18) takes the form
[TABLE]
As the second summand therein does not depend on , for any , the singleton choice is the same as that corresponding to the linear utility function and thus independent of the context . Consequently, at least one of and has to be non-linear in order to model context-dependent choices.
In contrast to this, for the case of FETA, linearity of the sub-utility functions does not imply context-independence of the model: If and are linear in the sense that and for any distinct and some weight vectors , the FETA utility function with sub-utility functions is given as
[TABLE]
for any . As the second summand therein depends not only on but also on , can in general not be represented as a linear function.
5 Implementation Using Neural Networks
Having defined the decomposition strategies FETA and FATE in the preceding section, we are still missing an algorithm, which can actually learn the utility functions involved. In this section, we propose realizations of the FETA and FATE approaches in terms of neural network architectures FETA-Net and FATE-Net, respectively. Our design goals for both neural networks are twofold. First, they should be end-to-end trainable using (stochastic) gradient descent, such that they can be used as part of a larger neural network architecture. To this end, we ensure that the outputs of the networks are differentiable almost everywhere with respect to the weights. Similarly, the loss functions employed in conjunction with a regularization term for the weights should also be differentiable almost everywhere and convex with respect to the utilities. Second, the architectures should be able to generalize beyond the task sizes encountered in the training data, since in practice it is unreasonable to expect all choice tasks to be of the same size.
5.1 FETA-Net Architecture
We will now describe our first neural network architecture FETA-Net and its training. Recall from Section 4.1 that we seek to predict utility scores of the form (11) for every object . What we need to learn, therefore, is the functions and .
In FETA-Net, we do so by means of a deep neural network architecture (shown in Figure 3). The network is trained in a set of data , where each is a choice task and the choice set observed for that task.
The main component is the neural network tasked with learning the pairwise utility function (depicted in blue). It receives the feature vectors of two objects and and outputs a score for in the presence of object . To build up the complete matrix would require iterating over all pairs of objects in . This is why we choose to adopt the CmpNN approach by [93] for the pairwise scoring function, i. e., instead of one output neuron we utilize two and . Weight sharing ensures that and holds. For the diagonal, we evaluate a separate network , which learns a latent utility component for each object (corresponding to the case in (10)). With that it suffices to iterate over all combinations of objects once, and to construct the matrix as follows:
[TABLE]
Then, each row of the relation is averaged to obtain a score for each object . Therefore, the network is a mapping and a mapping which can be instantiated by any neural network architectures suitable for the given objects. For our experiments later on, we shall use deep, densely connected networks. We treat the number of layers and units as hyperparameters and optimize them jointly with all the other hyperparameters.
The complete training algorithm for FETA-Net is shown in Algorithm 1, which is an instantiation of stochastic gradient descent. We will denote the weight vectors of the networks and by and , respectively. In the beginning, these weight vectors are suitably initialized in order to avoid exploding/vanishing gradients [42, 48]. In each epoch, the algorithm shuffles the given dataset and constructs mini-batches with for all . In lines 10 to 18, the pairwise relation is constructed as described above. The utilities for the objects inside the task are computed in line 19 by summing the pairwise relation across the columns of the matrix. Finally, the loss is computed in line 20 and added to the cumulative loss for the batch. The weight vectors and are updated using backpropagation in lines 22–23.
It is easy to see, that the training runtime complexity per epoch (including backpropagation) of FETA-Net is , where denotes the number of instances, is the number of features per object, and is an upper bound on the number of objects in each choice task. For a new task , the prediction time is in .
5.2 FATE-Net Architecture
The second architecture we propose is called FATE-Net, and the structure for predicting the score for one object is depicted in Figure 4. Inputs are the objects of the task (shown in green). Each object is independently passed through a deep, densely connected embedding layer (shown in blue). The embedding layer approximates the function in (18) and is a map . Note that we employ weight sharing, i. e., the same embedding is used for each object. Then, the representative for the task is computed by averaging the representations of each object. To calculate the score for an object , the feature vector is concatenated with to form the input to the final joint neural network layers (here depicted in orange). Again, weight sharing is used to learn only one scoring network. For both neural networks, we treat the number of layers, units and embedding dimensions as hyperparameters, which are to be optimized.
The detailed training algorithm is shown in Algorithm 2. As mentioned before for FETA-Net, it is an instantiation of stochastic gradient descent. We will denote the weight vectors of the networks and by and , respectively. The initialization of the weight vectors and the construction of the mini-batches (lines 10–14) is again the same as for FETA-Net. In line 18, the representative object is constructed by first mapping each object to the embedding space using , and then computing the centroid of the embedded points. The embedding network can be any network that receives an object and returns a -dimensional real-valued vector, and should be adapted to the data at hand. The utility scores are then computed by evaluating each object in conjunction with the representative point (see line 19). The cumulative loss for the mini batch is updated in line 20. The weight vectors and are updated by calculating the gradient of the loss using backpropagation and scaling it by an appropriate learning rate (lines 22–21).
The training runtime complexity per epoch of FATE-Net (including backpropagation) is , where denotes the number of choice tasks, is the number of features per object, and is an upper bound on the number of objects in each task. For a new choice task , the prediction can be done in only time (i. e., linear in the number of objects). This is due to the fact that only needs to be computed once.
6 Empirical Evaluation
The main goal of our empirical evaluation is to find out for which kind of problems FATE-Net and FETA-Net work well. Moreover, we wish to compare these approaches with existing methods for ranking and choice. In particular, the following questions will be addressed:
- •
Are the decompositions FATE and FETA suitable for learning context-dependent choice functions?
- •
How important is (i) the complexity/expressiveness of the underlying model class and (ii) its ability to model context-dependent choice functions, and how do these two factors interact? For example, are deep neural networks (i. e. FATE-Net and FETA-Net) really needed, or would a simpler (e. g. linear) model also suffice? Can the additional complexity/expressiveness compensate for the inability to model context-dependent choice functions?
- •
To what extent is our approach able to generalize over the task size? For example, is it possible to produce accurate predictions on tasks of a specific size, even if that size has never occurred in the training data?
For the first two questions, we evaluate the approaches on a variety of general choice and singleton choice problems. We also introduce the variant FETA-Linear, which learns the FETA decomposition using only linear functions, to ascertain whether it is able to account for some of the context-effects present in the data.
In addition, we evaluate the performance of different logit models used in economics: multinomial logit (MNL) [75], nested logit (NL) [123], generalized nested logit (GNL) [122] and mixed logit (ML) [110]. The first logit model is the MNL model (referred as GenLinearModel for subset choice task), which assumes that the choice between two objects does not depend on other objects in the set [67]. The NL and GNL belong to the generalized extreme value (GEV) class of models that learn correlations amongst the objects in the given set, which implicitly accounts for some of the context effects, but mainly the similarity effect [9, 113]. GEV models allocate the objects in the given task into different sets called nests and learn correlations between the objects inside each nest [122, 110]. These nests are disjoint in case of NL [123]. GNL is the most general model of this class, which allows the fractional allocation of each object in to each nest and it learns the correlation between them [122]. ML estimates the choice probability as a mixture of multiple logits [76, 125].
Another model which was proposed for solving the task of singleton choice is the PairwiseSVM, which makes use of induced pairwise preferences to fit a linear model [32, 68].
As a recent context-dependent baseline model, we implement the set-dependent aggregation (SDA) approach by [95]. We also implement the RankNet model as an additional context-independent baseline, which learns a non-linear utility for each object by converting them to pairwise preferences [107, 18]. Due to a lack of algorithms specifically designed for the subset choice problem, we employ the same thresholding of the utilities described in (4) we use for our approaches. The threshold is tuned on a small validation set for all approaches, using the -score as target loss (see Appendix C for details).
All in all, we compare to both deep neural networks and linear models, so that we have baselines of varying representative power, which helps to contextualize the performance of our approaches on each dataset. Finally, to answer the third question, we train the different models on a fixed task size and predict on queries of deviating size.
6.1 Setup
All experiments are implemented in Python, and the code and the dataset generators are publicly available111https://github.com/kiudee/cs-ranking. To properly compare all models in a fair and unbiased way, we make sure to optimize the hyperparameters of each model by employing Bayesian optimization in a nested validation loop (we use the Gaussian process based implementation in scikit-optimize [49]) The final out-of-sample estimates are then computed using another outer cross-validation loop with the best hyperparameters found in each fold. The loss functions and the datasets considered throughout our empirical evaluation are introduced in the following two subsections, respectively (see Appendix C for more details).
The experiments were run on a compute cluster with a mix of NVIDIA GTX 1080 Ti and RTX 2080 Ti GPUs (on average 15-20) and Intel Xeon E5-2670 processors. One job consisting of one outer split with complete hyperparameter optimization on the validation set took on average hours. The training of FATE-Net and FETA-Net on average (across datasets) required hours. Combined, all experiments took roughly GPU hours and CPU hours.
6.2 Loss Functions
As explained in Section 4, our goal during learning is to minimize a suitable target loss . This is usually the loss one is interested in minimizing, e. g., the -measure in our case. Since these losses are usually not differentiable, they cannot readily be used in a gradient descent algorithm. Therefore, during training we opt to minimize surrogate losses which are differentiable almost everywhere instead. In this section, we will first introduce the target losses we consider (cf. Section 6.2.1). We then derive surrogate losses based on the probabilistic choice models introduced in Section 3 and based on practical considerations (cf. Section 6.2.2).
6.2.1 Target Loss Functions
The canonical loss function, which we focus on in the singleton choice setting, is the categorical 0/1-loss
[TABLE]
i. e., in case the ground-truth choice is , each false prediction is penalized with a loss of . In addition, we will call the quantity the categorical accuracy. Moving from singleton to subset choice, where and can now be choice sets of arbitrary size, the same loss function (20) can still be used. To signify that it is used in subset choice, we will call it the subset -loss. Targeting the subset -loss is problematic, especially whenever a task contains many objects, since already one incorrectly predicted object results in the whole prediction being declared incorrect. One could instead opt to consider the average of the item-wise -loss, which is called the Hamming loss in the setting of multi-label classification [54]. However, this loss exhibits some properties that could be questioned in the context of choice. In particular, the non-prediction of a selected item (false negative) is penalized in the same way as the prediction of a non-selected item (false positive), although positives and negatives might be highly imbalanced.
A more suitable measure, which is widely used in classification, is the -measure defined as
[TABLE]
for any . This measure takes values in and large values indicate conformity between and , whence an appropriate loss can be defined as 222Later on, we will nevertheless report the -measure itself, which is common practice in machine learning.
[TABLE]
In spite of the existence of other measures that specifically aim at correctly predicting positives, such as the informedness [88, 87], we will mostly focus on as the target loss, because it is well known and commonly used as a performance metric. That means that we will use it as the validation loss for the Bayesian hyperparameter optimization we run for every learner. Additional evaluation measures we report are described in Appendix B.
6.2.2 Surrogate Losses
The probabilistic setting for choice that we introduced in Section 3 suggests a natural approach to learning and prediction:
- •
First, a learner is trained using the log-likelihood of the probabilistic model as a loss function. This loss function is not only differentiable, but also calibrated in the sense of being minimized by the true (conditional) probabilities. In other words, a learner trained with this loss is supposed to predict (unbiased) probabilities on the choice space (conditioned on the query).
- •
Thus, given a query for which a prediction is sought, a probability distribution on the choice space can be obtained as a prediction, which in turn allows for minimizing any target loss in expectation.
More specifically, let denote the latent utility scores , , predicted by a learner on a query . In a singleton choice scenario, where the data is supposed to be generated according to choice probabilities of the form (5) for some unknown ground-truth , one may define the corresponding categorical cross-entropy loss gained when observing
[TABLE]
This expression is minimized in case .
If dealing with subset choice data that is presumably sampled according to the choice probability distribution from (6), it is natural to measure prediction by means of the corresponding binary cross-entropy loss
[TABLE]
In spite of the theoretical justification of the logistic losses discussed above, we found that “hinge-variants” of the respective 0/1-losses may sometimes lead to more stable results. More specifically, for the singleton choice setting categorical hinge loss defined via
[TABLE]
for any , is inspired by the hinge loss used in multi-class classification [29, 78] and can be used instead of (22).
Finally, for training FATE-Net and FETA-Net in the experiments below, we use the binary cross-entropy loss for the subset choice setting and the categorical hinge loss for the singleton choice setting, since these turned out to work well in preliminary experiments. In addition, an -regularization term for the magnitude of the weights is added and optimized as part of the loss during training.
Convexity of the Surrogate Losses
An important consideration for the surrogate losses to be used during training is whether they are convex with respect to the utility scores . All three losses introduced above are indeed convex. To see this for , notice that (22) can equivalently be written as \log\bigl{(}\sum\nolimits_{\boldsymbol{y}\in Q}\exp(U(\boldsymbol{y},Q)-U(\boldsymbol{x},Q))\bigr{)}. The inner difference of utilities is linear and therefore convex. The outer function is also known as LogSumExp and is defined via . It is convex and since it is also strictly decreasing in each argument, the composition (22) is convex as well.
As for the binary cross-entropy , note that the inner function , of (23) is smooth with strictly positive first and second derivatives and hence convex and non-decreasing. Similarly, is convex and strictly decreasing on . Hence, we can conclude that (23) is convex.
Finally, the categorical hinge (24) contains the function , \boldsymbol{x}\mapsto\log\bigl{(}\sum_{j\in[m]}\exp(x_{j}-x_{i})\bigr{)}, which is convex as the logarithm of a maximum of convex functions. Since , is convex and non-decreasing, and therefore (24) is convex as well.
The FETA model further decomposes into an aggregation of sub-utility functions and . It is therefore interesting to ask whether the surrogate losses are also convex with respect to the sub-utility values , . We can answer this question in the affirmative, since the FETA utility values are positively weighted sums of these sub-utility scores.
However, the overall learning problem depends on the parameter of the realization of and and the corresponding loss function can possibly still be non-convex w.r.t. (as this is the case with the neural networks employed here). That means in practice we lose the guarantee of stochastic gradient descent to find a global optimum, but with careful tuning of the optimization process one can still expect to find reasonable solutions.
6.3 Datasets
We now introduce the learning problems used for the empirical comparison as follows:
- (a)
The Medoid problem, where the task is to predict the medoid of a set of points in a Euclidean space. 2. (b)
The Pareto-front problem, in which the learner has to predict the set of points which are Pareto-optimal. 3. (c)
The Hypervolume singleton choice problem, where the task is to select the point of the Pareto-front which contributes the most to the hypervolume. 4. (d)
Different choice problems defined on the well-known MNIST dataset. 5. (e)
Similarity/dissimilarity-based movie selection using the MovieLens Tag Genome dataset [118]. 6. (f)
The LEarning TO Rank (LETOR) MQ and MQ datasets [89] consisting of query-document pairs, with the goal to select the relevant documents. 7. (g)
The Expedia hotel dataset featuring search results and relevance labels for each hotel with the goal to select booked/considered hotels [33]. 8. (h)
The Sushi dataset, where the task is to choose the most preferred sushi from a set of options provided to a user.
See Table 1 for an overview of the datasets and their properties. In the following sections, we will describe the different datasets, their motivation, and if applicable, how they are generated.
6.3.1 The Medoid Problem
The motivation for this problem is the general idea of learning to choose a most representative element from a set. More concretely, the medoid of a set is the object with the smallest cumulative dissimilarity to all other objects of the set333As opposed to the centroid, which is usually not part of the original set.. It is commonly used as a representative element, especially for structured objects such as graphs, -D trajectories, images, etc. [116, 127].
Formally, we are interested in learning the choice function given as
[TABLE]
where we write here and throughout the remainder of this paper for the standard euclidean norm defined as . The singleton choice produced by this procedure incorporates all pairwise distances among the objects, which makes it a good context-dependent learning problem to investigate. In particular, is sensitive to changes of the elements in the task. With and we clearly have
[TABLE]
and thus is able to exactly model .
In contrast to this, for the FATE approach, it is not immediately obvious if and how it is capable of modelling exactly. However, the choices , and yield
[TABLE]
with being the centroid of . Thus, the item , which is closest to , i. e., , is likely to coincide with the medoid of . As we construct our synthetic medoid dataset by sampling according to the uniform distribution on for some predefined , there is with a FATE-instance, which is expected to have (for the case of singleton choice) an accuracy of at least
[TABLE]
on the synthetic medoid dataset. An empirical evaluation revealed that this value is for and . For the details on this dataset, confer Section E.1.
6.3.2 The Pareto-Front Problem
The computation of a Pareto-optimal set of points is an important problem in optimization and various fields of application [41]. We say is dominated by (short: ) if holds for any and for at least one . For any set we define the Pareto-set or Pareto-front of as
[TABLE]
We wish to investigate the possibility to learn the mapping from sets of points to their respective Pareto-sets. It is clear that the size of the Pareto-sets is not constant, which makes it a good candidate for a general subset choice problem. With the choices and we have
[TABLE]
Hence, holds trivially for each , i. e., the Pareto problem is exactly solvable via the FETA approach. We created our corresponding synthetic dataset by generating a set of points uniformly at random in and to construct a choice task , and the ground-truth is the Pareto-set of containing only the non-dominated objects. In order to perform the experiments, we generate sets of random points in and , and determine the choices as described in detail in Section E.2.
6.3.3 Hypervolume
A related but much harder problem is the computation of hypervolume contributions of objects on a Pareto front. The hypervolume of a subset describes the volume of the union of the subspaces dominated by each individual point in the Pareto set of and can formally be defined as
[TABLE]
where denotes the Lebesgue measure of . In the context of multi-objective evolutionary algorithms, one usually computes the contributions of each point to the overall hypervolume , i. e., the reduction in hypervolume caused by removing one object from the set. We consider the problem of learning the corresponding Hypervolume choice function , which picks that element with the smallest contribution to the overall hypervolume, i. e.,
[TABLE]
As shown by [17, Theorem 1 ], it is #P-hard to calculate . Here, we generate sets of random points in and determine the singleton choice.
6.3.4 MNIST Number Problems
The original goal of the Modified National Institute of Standards and Technology (MNIST) dataset was to facilitate the comparison between different handwritten digits classifiers [65]. It consists of grayscale images. We use the dataset to create challenging choice problems, both singleton and general subset choice. To level the playing field between all the approaches, we first train a convolutional neural network (CNN) on instances and use it to extract high level features for the remaining images (see Section E.3 for more details). To convert this dataset to a choice problem, we randomly sample sets of numbers and choose based on the following procedures:
Mode: For the Mode dataset, we choose the numbers that occur most often in the choice task . For example, given a set of numbers , , , , , , , , , we choose all instances with value equal to the mode value . For the singleton choice task, we only output one of the numbers (the representation of which has the least angle to a predefined vector). 2. 2.
Unique: Here, we choose all numbers that occur only once in the set of sampled label values. For example, given a set of numbers , , , , , , , , , , we choose the numbers . For the singleton choice problem, we ensure that exactly one of the digits is unique.
6.3.5 MovieLens Tag Genome
The MovieLens Tag Genome dataset consists of a large collection of movies and community curated tags [118]. For each movie, the relevance of every tag is provided on a continuous scale in . Thus, the complete relevance vector of a movie can be regarded as that movies’ “genome.”
We consider the problem of choosing the most similar/dissimilar movie from a set of movies, where one movie is regarded as the reference to which the others are compared. We define this reference movie to be the medoid of the movies in a given set. To compute similarities in tag relevance space, we use the weighted cosine similarity as proposed by [117].
6.3.6 LETOR
LETOR is a collection of benchmark datasets for different learning-to-rank problems [89]. The Gov2 web page collection, consisting of roughly 25 M pages, is the corpus and the query sets of the Million Query track of the TREC and [111, 112] are used to create datasets. Each query-document pair is defined by a vector consisting of features. We use the supervised ranking datasets MQ and MQ to create the choice dataset. We treat all documents with a relevance score of 1 and 2 as the chosen objects. Since all queries include multiple documents with relevance scores and , we cannot extract singleton choices from this dataset. The listwise ranking datasets MQ-list and MQ-list contain real-valued scores of the documents in the underlying permutations, and hence facilitate the singleton choice for each query (details of the exact procedure can be found in Section F.1).
6.3.7 Expedia
The Expedia dataset was released on the Kaggle website as a competition in 2016 [33]. It consists of lists of hotels, each resulting from a search query of a user. For each hotel, there are features and a relevance score, indicating how relevant the hotel is to the provided query. A score of [math] means that it was not relevant, a score of indicates that the user clicked on it, and a implies that the hotel was booked. It is straightforward to construct choice datasets: for singleton choice the goal is simply to predict the booked hotel, whereas for subset choice we required the learners to output the complete set of hotels that were at least clicked on (see Section F.2 for more details).
6.3.8 SUSHI
SUSHI444This dataset can be downloaded from http://www.kamishima.net/sushi/ is a dataset created by [57] specifically for the task of object ranking. The authors considered sushis and asked users to rank them according to their preference. The dataset consists of two sets of rankings. Each ranking consists of sushis, which were ranked by users in a survey. For the first set, the authors asked the users to rank the top-10 most popular sushis. In the second set, users were shown random sets of sushis instead. Each sushi is described by object features. Additional user features are available, but not used in our experiments. For our experiments, we merge both datasets into a single one containing instances. We use it as a singleton choice dataset by choosing the most preferred sushi as the singleton choice for the given task set (details of the exact procedure can be found in Section F.3).
6.4 Results and Discussion
In this section, we provide the results obtained by evaluating different subset choice and singleton choice models on the datasets. To be concise, we only show plots for the target losses here and list the complete set of results in Tables 9, 10 and 11 in Appendix G. It is illuminating to compare the performance of FATE-Net and FETA-Net to the context-independent neural network RankNet. This provides a rough indicator for how important being able to model context-dependence is.
6.4.1 Singleton Choice
We will start by discussing the results for the singleton choice models (cf. Figure 5), where the bars depict the mean value of the categorical accuracy (26) across the cross-validation folds, with black lines depicting the standard deviation.
The first observation is that FATE-Net and FETA-Net significantly outperform all other baselines on the tasks for which it was clear that the underlying choice function is context-dependent (i. e., Hypervolume, Medoid and the MNIST datasets). The SDA network, which is also a context-dependent model, achieves competitive results on the Medoid and the MNIST datasets. The linear FETA variant FETA-Linear non-linear neural network RankNet perform comparably to the other baseline approaches. This suggests that a combination of non-linearity and the ability to model context-dependence is really necessary to improve on these tasks. One notable exception is the Medoid dataset, for which RankNet and FETA-Linear manage to outperform the other baselines by a large margin.
For the MNIST-Unique problem, FATE-Net and FETA-Net achieve an accuracy of more than and SDA is competitive with over . Additionally, the GNL and ML models are also able to perform better than the other baselines. It is easy to see that the dataset exhibits the similarity context effect proposed by [52], i. e., adding multiple instances of the same digit to the choice task reduces the choice probability of all equal digits to 0. As is apparent, the GNL and ML model are able to account for it and score better than chance.
Since FATE-Net, FETA-Net and SDA were able to achieve close to accuracy on the MNIST-Unique problem, we performed an additional experiment where we generated instances completely synthetically. Each number we represent by the corresponding standard unit vector , which is in the -th position and is [math] everywhere else. Apart from that, the task remains the same. We calibrate each network to have roughly the same number of parameters ( for FATE-Net, for FETA-Net and for SDA) and the remaining hyperparameters were equal for all networks. We then trained them on a stream of newly generated batches with 1024 instances, each of which with 10 objects until convergence. The resulting convergence behavior is shown in Figure 6. Both FETA-Net and SDA are able to converge to out-of-sample categorical accuracy within 100 epochs, while FATE-Net only achieves slightly over and more epochs alone were not able to let it learn the target function without error. We therefore repeated the experiment for FATE-Net with a higher epoch and parameter budget. With parameters, FATE-Net is now able to perfectly learn the fully synthetic unique problem within 400 epochs. On the one hand, this shows that from a representational perspective, all three models are able to learn this particular target choice function perfectly. FATE-Net appears to be less parameter- and data-efficient though, which could indicate that evaluating the utilities in the context of the set embedding is not well suited to represent these kinds of problems. The behavior of all three networks was consistent across repetitions of the experiment.
On the real-world datasets (i. e. Sushi, Movielens Tag Genome, LETOR and Expedia) the performance of FATE-Net and FETA-Net is closer to the ones achieved by the remaining baselines. Although they still obtain slightly higher accuracy on average, the margin is not as pronounced. Surprisingly, the SDA achieved the worst accuracy on LETOR and Expedia. We suspect that this results from the models being trained only on a fixed choice task size in our experiments, while they are evaluated on choice tasks of varying size during test time. Since SDA learns a set-dependent aggregation function, it could be that this does not generalize well to the larger choice tasks present in the real-world datasets.
6.4.2 Subset Choice
We evaluate the subset choice models in terms of their -measure (21) and report the results in Figure 7. To see if the models are able to learn anything, we also show the performance of the baseline that always predicts positive.
The general pattern is confirmed: FATE-Net, FETA-Net and SDA surpass the other baselines on the datasets Pareto-front 2D, MNIST Mode, and MNIST Unique, while being competitive for the real-world datasets LETOR and Expedia. For the MNIST tasks Unique and Mode, the first observation is that all linear and/or context-independent baseline approaches fail to learn anything on these datasets, since they all achieve the same -measure as the all-positive baseline. Thus, it is clear that these tasks can only be solved by models that are both context-dependent and non-linear.
For the Pareto problem, it can be observed that the context-dependent models FETA-Net, FATE-Net, and SDA outperform all benchmark choice models on the 2D version. On the 5D version of the dataset, however, the performance of all approaches reach a comparable level. This indicates that solving the task of selecting the Pareto-front becomes less context-dependent in higher dimensions, since the distance of a point from the center becomes more and more informative. At the same time, more points are on the Pareto-front overall, which is apparent from the high -measure of the AllPositive baseline.
As before, the results are more homogeneous on the real-world datasets Expedia and LETOR MQ/MQ. FATE-Net and FETA-Net are still outperforming all the benchmarks. This suggests that the ability to model context-dependence in the data is slightly more important for these datasets than learning a non-linear utility function. SDA achieves the best result on the Expedia dataset, which when compared to the bad performance on the singleton choice variant of the dataset suggests that the thresholding of the utilities is robust to the model output changing with varying choice task sizes.
Overall, the results demonstrate that FATE-Net and FETA-Net are able to improve on the context-independent baselines by a large margin on tasks which are strongly context-dependent and show competitive results when compared to SDA. The improvement is due to both the task-sensitivity of these models and the ability to model non-linear utility functions. For the real-world datasets, the improvements are smaller, suggesting that context-effects are either less pronounced or that the context-effects in real-world data cannot fully be captured yet.
6.4.3 Generalization Across Task Sizes
We conduct additional experiments to gauge the generalization capability of the learned models to unseen task sizes (refer to Appendix D for more details). We show the results for the datasets Medoid and Hypervolume, because, as will be seen, they exhibit some interesting properties. We specifically compare the performance on the singleton choice datasets (Figure 8).
We train the models on a fixed task size and then test them on sets containing between and objects. Note that for singleton choice, the accuracy is not comparable across differing task sizes. We instead report the normalized accuracy (see Section B.2), which fixes this issue and guarantees that random guessing achieves exactly 0.
Overall, the models manage to generalize quite well to task sizes for which they were not trained. The exact generalization behavior depends on the dataset, though. Considering the Medoid dataset, we can observe that the models FETA-Net, FETA-Linear and RankNet even improve in performance with larger task sizes. This is plausible, since the more points fill the space, the more the problem can be solved by a context-independent model, which assigns the highest score to objects in the center. For the singleton choice version of Hypervolume, on the other hand, the performance of all models drops with an increasing numbers of objects, suggesting it becomes much harder to identify the object that contributes the most to the overall hypervolume. This is especially visible for the baselines, which, even though they were trained on 10 objects, achieve their best performance on 3 objects. FETA-Net, FATE-Net, and FETA-Linear stand out here, since their performance decays much slower. All in all, we conclude that our networks FETA-Net and FATE-Net are able to generalize very well to unseen task sizes, with FETA-Net additionally benefiting if the task becomes less context-dependent with larger task sizes.
7 Conclusion and Future Work
In this paper, we tackle the problem of choice from a machine learning perspective. More specifically, we propose a framework for learning context-dependent choice functions, which, on the basis of choice behavior observed in the past, allow for predicting the choice of objects in new situations. This is essentially accomplished by learning generalized (latent) scoring (utility) functions, which are supposed to control the choice behavior.
Violations of context-independence are common in human choice behavior. Therefore, accounting for the various context effects they can exhibit can be seen as an important problem. Still, we consider the space of interesting non-trivial choice functions to be vastly larger, and the goal is to have general purpose models that can adapt to a wide variety of (yet unknown) context effects.
To this end, we propose two principled decompositions: The FETA decomposition is a first-order approximation to a more general utility decomposition. It considers each object in local sub-contexts, the contributions of which are averaged. The FATE approach, on the other side, first transfers each object into an embedding space and computes a representative of the choice task by averaging these embedded points. The utility of each object is then evaluated with the representative as global context. Both approaches are complementary and have differing inductive biases. In spite of this, both show promising predictive performance.
While the FETA and FATE decompositions are general and in a sense quite natural approaches to model context-dependent choice functions, a promising direction is the investigation of application-specific models with more focused inductive biases. An example is the SDA approach, which applies principles from behavioral choice theory and also tries to take the risk-aversion of humans into account [95].
While the most influential context effects for human choices have been studied, gaining a deeper understanding of the rich mathematical structure of general choice problems is an important future endeavor.
Acknowledgements
The authors gratefully acknowledge the financial support provided by the European Regional Development Fund (ERDF) and the valuable feedback provided by the industry partners of the Smart-GM research project – EFRE-0801915.
Funded by the Deutsche Forschungsgemeinschaft (DFG – German Research Foundation) – 317046553.
This work is part of the Collaborative Research Center “On-the-Fly Computing” at Paderborn University, which is supported by the German Research Foundation (DFG). Experiments were performed on resources provided by the Paderborn Center for Parallel Computing.
Appendix A Notation
Appendix B Evaluation Measures
Besides the target losses introduced in Section 6.2, we evaluate the trained models using additional evaluation measures. These should give a more complete picture of the performance of the different models. The results including the additional measures can be found in Appendix G.
B.1 Singleton Choice
To define the evaluation measures in the singleton choice setting, suppose in the following a choice task space , a utility function for as well as and to be arbitrary but fixed.
Top- Categorical Accuracy
The top- categorical accuracy is defined as the fraction of times in which the set of objects in the top positions, according to the predicted scores, contains the ground-truth chosen object [23, 9]. Formally, writing with , we have
[TABLE]
Categorical Accuracy
The categorical accuracy is defined as the fraction of times in which the object with the largest score is the same as that ground-truth singleton choice, i. e.,
[TABLE]
The categorical accuracy is the most common measure used for the evaluation of SCMs and commonly referred to as hit-rate [9]. It is evident that \operatorname{m}_{\text{CA}}(U,Q,\{\boldsymbol{x}\})=\operatorname{m}_{\text{top-1}}(U,Q,\{\boldsymbol{x}\}) holds, provided is a singleton set.
Normalized Accuracy
The measures defined above are not a reasonable estimate when observing the performance of an SCM on the choice tasks of different sizes , since the task becomes harder as the choice task size increases. The hardness of the task should be adjusted with respect to the accuracy that random guessing can achieve, which is defined as the probability of choosing the correct singleton choice from the choice task . Assuming each object to be chosen with the same probability, the probability for choosing a fixed object is . These considerations motivate the definition of the normalized accuracy as follows:
[TABLE]
Note that this measure takes values in . The minimum value of is achieved when the algorithm performs with an accuracy of [math], i. e., it is worse than random guessing, and the maximum value of when the learner always predicts correctly. A value of [math] indicates that the learner performs similar to random guessing. This measure was derived using the “correction for guessing” formulation [28].
B.2 Subset Choice
For the subset choice setting, we introduce accuracy measures in terms of a choice task and two corresponding choices for . Here, may be thought of as the ground-truth choice for and as a prediction made by a learner. In contrast to the singleton choice setting, these measures do not depend on a utility function. For the sake of convenience, we suppose , and to be arbitrary but fixed in the following. To prepare some of the measures, let us formally define the quantities true positives (), true negatives (), false positives () and false negatives () via
[TABLE]
respectively. These quantities are similar to those used to define the confusion matrix in the case of binary classification [64].
Subset Accuracy
The Subset Accuracy measures the number of times the ground-truth choice set and the predicted choice set are exactly the same. This measure is used to measure how often the algorithms predictions match the complete choice set. Formally, it is defined as
[TABLE]
Recall
Recall is defined as the proportion of real positive cases that are correctly predicted positive [87]. In the field of information retrieval, it is the fraction of the relevant documents that are successfully retrieved. For our choice setting this can be defined as the fraction of objects from the ground-truth choice set which chosen successfully or are present in the predicted choice set , i. e., formally as
[TABLE]
Precision
Precision denotes the proportion of predicted positive labels that are correct [87]. For the choice setting, this can be defined as the fraction of objects from the predicted choice set that are actually chosen by the decision maker or that are present in the ground-truth choice set . Formally, it is defined as:
[TABLE]
-Measure
The -measure is defined as the harmonic mean of precision and recall:
[TABLE]
It can also be expressed in form of the confusion matrix quantities as follows [64]:
[TABLE]
Informedness
The informedness is a measure proposed by [88, 87], which is, in contrast to the -measure, unbiased with respect to the population prevalence of positives. It specifies the probability that the learner makes an informed prediction if compared to chance and is formally defined as
[TABLE]
A very desirable property of this measure is that it is exactly [math] in case the learner is guessing or is constant.
AUC-ROC
The AUC-ROC is a performance measure, which estimates the capacity of a classification model to distinguish between two classes [35, 74]. It computes the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one [74]. It is estimated by computing the area under the ROC-curve, which is created by plotting the true positive rate against the false positive rate , where
[TABLE]
A very desirable property of this measure is that it exactly in case the learner is guessing.
Appendix C Additional Experimental Details
In this section, we will now list all experimental details which were excluded from the main paper for conciseness reasons. First, we explain the process of nested cross-validation using the hyperparameter optimization in detail. Then we explain different hyperparameters which were tuned for different models and which parameters were kept fixed. Lastly, we explain the design generalization experiment.
Empirical Comparison
In order to compare all learners fairly, we do nested cross-validation with synchronized random streams for all the learning models, as shown in Figure 9. The hyperparameters of all models are tuned using extensive Bayesian optimization. We describe the complete procedure in two parts: first the hyperparameter optimization and second the out-of-sample evaluation. First, we configure the given the learner with the default parameters described in the next section. Then we generate sets of training and test dataset and the process which is used to generate a train-test set for is described in Table 1.
Hyperparameter Optimization
The training set is used to first identify the best hyperparameters using -fold stratified cross-validation, and then to train the final learner for out-of-sample evaluation. The hyperparameter optimizer picks hyperparameters from the ranges in Table 3 () for the iteration. In the inner loop , we split the full training dataset into train set ( of ) and validation dataset ( of ) using the stratified shuffle split. For the given hyperparameters , we train the model on the train set () and evaluate on the validation dataset using the target loss function. We use the -measure for general subset choice and the 1-categorical accuracy for singleton choice as the target loss to evaluate the hyperparameter configuration. We calculate the mean loss for the given hyperparameters . The optimization loop is run for iterations to validate sets of hyperparameters, in order to acquire the optimal parameters for the given learning model.
Out-of-Sample Evaluation
Finally, after optimization, we configure the learners using the best found hyperparameters and the remaining default parameters . Then, we train the model on the complete training dataset and evaluate on the test dataset using different evaluation measures defined in Appendix B. To obtain a good estimate of the mean performance and an estimate for the standard deviation, we repeat this procedure times using outer cross-validation. For each fold , we get the evaluated value and calculate the mean and the standard deviation of the performance measure .
Hyperparameters & Inference
We will now describe the specific hyperparameters we optimize and which ranges of values we consider (see Table 3 for an overview). For probabilistic models, we also describe how the inference is done. For all neural network models, we make use of the following techniques:
- •
We use either rectified linear units (ReLU) non-linearities in conjunction with batch normalization (BN) [53] or self-normalizing linear units (SELU) non-linearities [62] for each hidden layer.
- •
Regularization: penalties are applied and the corresponding regularization strength is tuned.
- •
Optimizer: stochastic gradient descent (SGD) with Nesterov momentum [79].
- •
A step-decay function is used for the learning rate annealing schedule. The decay factor is tuned [31].
The step-decay function drops the learning rate by a factor after a certain number epochs [31]. Formally, it is defined as:
[TABLE]
where is the initial learning rate, is the rate with which the learning rate should be reduced, is the current epoch and is the number of epochs after which the learning rate is decreased. We set the maximum number of epochs the neural networks are trained for to .
The hyperparameters of each algorithm were tuned using the package scikit-optimize [49]. Apart from the number of hidden layers and units, we also tune the learning rate of the stochastic gradient descent optimizer, regularization strength and batch size (fraction of training examples used for estimating the gradient in one iteration). We also tune the drop-rate and epoch-drop for the step-decay function used by the Stochastic gradient descent optimizer by the neural networks. For PairwiseSVM, we tune the value of the penalty parameter of the error term, and another is tol (tol in scikit-learn) which is the tolerance for the stopping criteria of the optimization algorithm [84]. All of the different GEV models are implemented in PyMC3 a library for facilitating Markov Chain Monte Carlo estimation of the posterior distribution [97]. An overview of all the hyperparameters and their admissible ranges is shown in Table 3.
Threshold Tuning
In order to set the threshold for the subset choice models (4), we tune the threshold for all models on a small validation set. Obviously, an optimal value for will depend on the underlying target loss function. Our main target loss is the (micro-averaged) -measure (21), which balances precision and recall of the predictions [66, 124, 121]. [64] show that tuning a threshold on a validation set, yields a consistent classifier, if the estimated marginal instance probabilities (in our case the choice probabilities) converge in probability to the population-level probabilities. One important difference to the multi-label classification setting is the absence of a fixed set of labels. Instead, we have a dynamically changing set of objects. Thus, it only makes sense to consider micro-averaged performance metrics.
Appendix D Design of the Generalization Experiment
The second experimental setup is designed to gauge the generalization capability of the learning models by measuring the accuracy obtained by a trained model on unseen task set sizes. To this end, we vary the task set sizes from to as shown in Figure 10.
First, we configure the learning model with the best hyperparameters obtained from the empirical comparison experiment for the given dataset and the remaining default parameters . Then we generate the training dataset containing task sets of size and train the configured model on the training dataset .
Finally, we evaluate the trained model on different test datasets containing the task sets of sizes in () as described in Table 4.
Appendix E Synthetic Datasets
In this section, we will formally describe the process of generating the datasets for the experimental evaluation. In the case of synthetic datasets, this entails the complete process by which the objects and queries are generated.
E.1 The Medoid Problem
Recall that we have defined the medoid of a set as , where is the standard euclidean norm in . Thus, the medoid of may be thought of as the most centrally located object in , cf. the illustration of a choice set of size and its medoid in Figure 11(a). As it depends on its distance to any other point from , the medoid of is sensitive to changes of any points in .
For our empirical study, we created a dataset by drawing each independently and uniformly at random from the set
[TABLE]
and then choose . Here, the sampling step can be performed via the acceptance-rejection method: One may repeatedly sample uniformly at random from until has size and a unique medoid. Regarding that this condition is already fulfilled with probability after sampling only once, this method is efficient.
E.2 The Pareto Problem
Above, we introduced the Pareto set of a set as the set of all elements which are not dominated by any , wherein was said to dominate if and . Figure 11(b) shows the Pareto set of a set .
With the help of Pareto sets we create a synthetic dataset for the subset choice task, where each sample is generated independently of the others in the following way:
Sample i.i.d. uniformly at random from 2. 2.
Draw i.i.d. samples from , the standard Gaussian distribution on , and define for each . 3. 3.
Choose and .
Hypervolume
In Section 6.3.3 we have introduced for the choice set as the set of all , which contribute the least among all elements in to the hypervolume of , cf. Section 6.3.3 for the precise definitions and also for the connection of the hypervolume of to the Pareto front of . As this contribution of each point depends on the position of other points in , is context-dependent. This is illustrated in Figure 11(b), where all five elements of lie on the Pareto front of . There, the contribution of point is largest in , but if we remove the point from the choice set, it increases the contribution of the point for the set. So, the singleton choice changes from to , after removing from .
Based on we construct a singleton choice dataset by sampling each uniformly at random from the set of all , which fulfill
[TABLE]
and then defining afterwards. Similarly, as in the construction of the Medoid data set, sampling can be done via the acception-rejection method.
E.3 MNIST Number Problems
In this section, we will describe the process of generating different semisynthetic datasets using the MNIST dataset [65].
Feature Extraction
Since the dataset consists of -D image maps, we first train an off-the-shelf CNN to solve the digit multi-class classification task to level the playing field and abstract away from the computer vision context. This architecture of the CNN consists of -D Convolutional, -D Max-Pooling, and fully-connected dense layers and applied batch normalization to increase the stability of the network, by subtracting the batch mean and dividing by the batch standard deviation as shown in Figure 12 [43, 53].
The -D convolutional layer is of kernel-size using rectified linear units (ReLU) non-linear activation function and l- regularization and -D max-pooling layer, with filter of size applied with a stride of , which down-samples the input by along the width and height, discarding of the activations by applying max operation over numbers in region [43]. The output of these layers is provided as input to a fully-connected sequential network with outputs, where each output predicts the probability of the input image belonging to a particular class using the softmax [43]. We train this network on instances, then we transform the remaining digits to a high-level feature representation by passing them through the trained CNN and recording the outputs of the last hidden layer (D2).
The transformed MNIST dataset , is represented as a set of tuples , where is the feature vector and represents the corresponding label, such that , , and holds for all . For constructing the choice datasets, we sample instances from the transformed dataset uniformly at random, to construct a task set . Based on and , we then select as choice set , where is an appropriately predefined function . We consider two variants for , namely and .
The function outputs the instances corresponding to the numbers which occur only once in the label vector. For example
, corresponding to the numbers , and . For singleton choice choice, we sample only the task sets, whose corresponding label vector contains a single unique number, to make it identifiable, i. e., for example . The section function is , which outputs the instances corresponding to the number which occur most frequently in the label vector. For example , corresponding to the mode For singleton choice choice, we choose the instances corresponding to the mode, which are at the least angle from a predefined weight vector .
Both functions used to generate choices depend on all other objects in the given task set , thus making the datasets highly context-dependent.
Unique
In this subsection, we explain the data generation process for the Unique choice dataset using the function defined above. For generating the dataset, we select a set of instances from uniformly at random to construct the task set and the label vector . Then we choose the objects from which corresponds to the unique digit in the label vector (an example is shown in Figure 11(c)). Let us assume we want to generate a dataset with instances.
Sample data points from , let and 2. 2.
For each let be the number of times the label appears in the label vector for , define and write for convenience in the following. For example for we have . 3. 3.
We create by selecting the objects whose values occur only once in the label vector :
[TABLE] 4. 4.
In order to create the corresponding singleton choice or top- version of this dataset, we discard in case and repeat steps 1–4. If instead, we keep the sample .
Mode
In this subsection, we explain the data generation process for the Mode choice dataset using the function defined above. For generating the dataset, we select a set of instances from uniformly at random to construct the task set and the label vector (an example is shown in Figure 11(c)). Then we choose the objects from which corresponds to the mode value of the label vector to construct the ground-truth set of chosen objects. For creating the corresponding singleton choice or top- dataset, we choose the object corresponding to the mode value of the label vector, which is at the least angle to the predefined weight vector . Let us assume we want to generate a dataset with instances. First, we sample the weight vector .
Sample data points uniformly at random from , abbreviate and let . 2. 2.
As for the Unique dataset, write for the number of times the label appears in the label vector for , define and write again . 3. 3.
For the case of subset choice define
[TABLE]
and in case of singleton choice, select to be that set, which contains only the object with the least angle to vector , i. e.,
[TABLE]
E.4 Tag Genome Dataset
The GroupLens Research group released many datasets collected from the MovieLens website555https://movielens.org/ for research in the field of recommender systems [47]. As of August 2017, the full dataset collected from this website consists of ratings and tags applied to movies by users [47]. One of the datasets is the Tag Genome dataset666This dataset is available on https://grouplens.org/datasets/movielens/, which provides real-valued features to characterize the movies [117].
Tags are meta-data in the form of keywords, which help to describe an object (such as movie, music, books). In recent years tagging has gained popularity due to the growth of social networking websites and web search engines [105]. On the MovieLens website, users create tags to describe a movie. Other users can then use them to filter movies more effectively. Users can also gain more information about a movie with the help of tags applied by other users.
The Tag Genome dataset was generated by applying machine learning algorithms on the information provided by users for a movie in the form of tags, reviews, and ratings [118]. It consists of movies and a set of tags applied to each of them, and a score between [math] and quantifying the relevance of each tag to the particular movie (as shown in Figure 13). Currently, this dataset consists of around 12 million relevance scores across tags applied on movies.
Framework
According to [117] the Tag Genome dataset consists of:
: The set of movies , where . 2. 2.
: The set of tags , where . 3. 3.
: Relation such that denotes the degree to which extent the tag applies to the movie on a scale of [math] to ; here [math] indicates no relevance and indicates strong relevance to the movie (as shown in Figure 13). 4. 4.
: Relation mapping each movie to its feature vector in tag-space (vector of tag relevance values across all tags), such that . 5. 5.
: Function representing the popularity of a tag, measured as the number of users who applied the tag . 6. 6.
: Function representing the movie frequency of tag , i. e., denotes the number of movies for which the relevance of tag is greater than . 7. 7.
: The set of top most popular-tags based on the popularity tag-pop.
The weighted cosine similarity is a similarity measure defined in [117] to measure the similarity between two movies. The weight vector is defined in such a way that more weight is assigned to both the popular tags because this implies that more users care about these tags and also to more specific tags because they can uniquely identify the similarity. For example, if two movies have the harry potter tag in common, they are more likely to be similar than the ones that have the tag fantasy in common [117]. A -transform is applied to both values to bring them closer to the normal distribution. The weighted cosine similarity between two movies his defined as:
[TABLE]
where and for any .
To construct the singleton choice semisynthetic dataset, we sample uniformly at random movie items from to create a task set , and we choose the medoid of as the reference movie.
We define two tasks based on the reference movie of the sampled task set . The first task is to choose the most similar movie to the reference movie in task set . The second task is to choose the most dissimilar movie with respect to the reference movie for a given task set . This problem is similar to finding the outliers for a given set of objects which can be used to solve the problem of anomaly detection [1, 21]. Both tasks used to generate semisynthetic datasets depend on the similarity between all objects in the given task set , thus making the datasets highly context-dependent.
Data Generation Process
We explain the data generation process for the Tag Genome Similar Movie and Tag Genome Dissimilar Movie datasets. Let us assume we want to generate a singleton choice dataset with instances. Each task set and its corresponding singleton choices is constructed in the following way:
Sample i.i.d. and uniformly at random from , let for each and . 2. 2.
Compute the reference object (movie) for (medoid):
[TABLE] 3. 3.
Now we define the corresponding singleton choices for Tag Genome Similar Movie and Tag Genome Dissimilar Movie dataset.
- (a)
The singleton choice set for for Tag Genome Dissimilar Movie is the set consisting of only that element of , which is most dissimilar to , i. e., formally
[TABLE] 2. (b)
For the Tag Genome Similar Movie dataset, we select for the task the singleton choice set
[TABLE]
which consists of the one element from , that is most similar to .
Appendix F Real-World Datasets
Some widely used benchmark-datasets available for solving this task are LETOR and SUSHI [89, 58]. In the following sections, we briefly describe these datasets and the process we use to generate singleton and subset choice datasets.
F.1 LETOR Datasets
LETOR777Version 4.0 is a package of benchmark datasets released by Microsoft Research Asia, which are used to compare and evaluate different learning algorithms in the field of preference learning [89]. We use the datasets MQ and MQ released for learning the task of partial ranking to create the subset choice dataset. There are other datasets MQ-list and MQ-list released for learning the task of complete ranking888These datasets are available on https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/ to create the singleton choice dataset.
LETOR Supervised Datasets
The datasets (MQ and MQ) consist of the queries and retrieved documents, with individual preferences in the form of a relevance for each document with respect to the corresponding query [89]. The format of both datasets (MQ and MQ) is the same, and there are about queries in MQ and about in MQ with labelled documents. These datasets consist of features extracted from a query and document constructing an object called query-document and each pair is labelled with a relevance score in , indicating how relevant the document is to the respective query as shown in Figure 14(a). A relevance score of [math] means that the document is not relevant, means relevant and means very relevant to the query. For this dataset, the goal of the choice problem is to choose all the relevant documents for the given task.
Structure
The dataset consists of a universal set of objects . Each instance of these datasets , is represented as set of tuples , where is the task set ( features extracted from query-document) and represents vector of relevance label for the given set of objects, such that for all and for every .
The size of the universal set of objects in the MQ dataset is , i. e., and the MQ dataset is , i. e., . These datasets have been partitioned into parts by [89], such that . This partition is used to conduct -fold cross-validation, and for each fold, we use four parts for training and the remaining part for testing as described in Table 5.
Choice Data Conversion
The corresponding choice dataset is created by considering the documents in as the task sets and the set of relevant documents as the corresponding choice set for each instance . For training the choice model, we sub-sample objects from each query instance to construct the task sets. Note, that we still evaluate the models on the corresponding test choice dataset, which consists of all original queries for each fold as described in Table 5.
LETOR Listwise Datasets
The format of both listwise datasets is the same as the supervised one. There are about queries in MQ-list and about queries in MQ-list with each query-document pair consisting of features. In this dataset, all the documents for each query are labelled with a real-valued relevance score instead of the multiple level relevance judgments as shown in Figure 14(b). The documents on top positions in the ground truth permutation have larger value of the relevance degree.
Structure
The dataset consists of a universal set of objects . Each instance of these datasets , is represented as a set of tuples , where is the task set ( features extracted from query-document) and represents a vector of relevance score for the given set of objects, such that for all and for every .
Singleton Choice Data Conversion
The corresponding singleton choice datasets are created by considering the documents in as the task sets and the most relevant document as the corresponding singleton choice set for each instance . For training the SCM we sub-sample objects from each query instance to construct the task sets. Note that we still evaluate the models on the corresponding singleton choice test dataset, which consists of all original queries for each fold as described in Table 5.
F.2 Expedia Hotel Dataset
Expedia released a dataset on the Kaggle website as a competition and for research purposes999These datasets are available on https://www.kaggle.com/c/expedia-personalized-sort/data. The dataset includes browsing and booking data as well as information on price competitiveness. The data are organized around a set of search result impressions, the ordered list of hotels that the user sees after they search for a hotel on the Expedia website. In addition to impressions from the existing algorithm, the dataset contains impressions where the hotels were randomly sorted, to avoid the position bias of the existing algorithm. The user response is provided as a click on a hotel and/or a purchase of a hotel room. This dataset consists of search queries and features extracted from the search query and the hotel constructing an object. Each hotel is labelled with a relevance score of [math], or , indicating how relevant the hotel is to the respective query or the user. A relevance score of [math] means that the hotel is not clicked, means it was clicked and means the hotel was booked by the user. This dataset is very similar to the LETOR dataset as shown in Figure 14. For this dataset, we define the learning target to be the set of relevant hotels (clicked and/or booked). Since for each query, the number of hotels displayed is different, this dataset consists of different task sizes.
Structure
The dataset consists of a universal set of objects . Each instance of the datasets , is represented as a set of tuples , where is the task set ( features extracted from hotel) and represents the vector of relevance label for the given set of objects, such that for each and for all .
The number of instances in this dataset is , i. e., and the size of the universal set of objects (hotels) is , i. e., . There are features which have missing values, and we removed the features which consist of more than missing values. For the remaining features which have of missing values, we impute them with a negative value less than . The models are trained on the resulting dataset with features.
Data Conversion Process
We create folds by shuffle-splitting the dataset randomly into test and train instances. The choice dataset is created by considering the hotels in as the task set and the set of relevant hotels as the corresponding choice set for each instance . The models are trained on the sampled training dataset and corresponding test dataset using -fold stratified cross-validation as described in Table 7.
Singleton Choice
In order to create the singleton choice dataset, we just consider the samples where the user booked the hotel, which is the singleton choice for the given query. The singleton choice dataset is created by considering the hotels in as the task set and the set of booked hotels C_{i}=\bigl{\{}\boldsymbol{x}_{j}\in\tilde{Q}_{i}:l_{j}=2\bigr{\}} as the corresponding choice set for each instance .
The models are trained on the sampled training dataset and corresponding test dataset using -fold stratified cross-validation as described in Table 7. Note, the instances where the hotel was not booked at all were discarded and only the instances where there was booking were considered.
F.3 SUSHI Dataset
SUSHI101010This dataset can be downloaded from http://www.kamishima.net/sushi/ was another dataset released for solving the task of object ranking. This dataset was collected by surveying individuals, such that each person was provided with two item sets and . Set consist of most famous sushi and consists of top sushi famous in Japan. Individuals were asked to provide the preferences in form total order for items in set , and a real numbered score between [math] and for sushi in set . There were missing rating values for many items in set , so they extracted the total order for the top preferred items by each user.
The SUSHI dataset consists of universal set of objects , with size , i. e., , with set of object of size and each sushi consists of features, i. e., . The instances of the dataset , are represented as a set of tuples , where is the set of objects and represents the underlying orderings for the given set of objects , such that , and holds for all .
The dataset contains the following features:
Style: This is a binary feature, which describes whether the sushi is a Maki or other, where [math] means Maki sushi and means others. 2. 2.
Major Group: This is a binary feature, which describes whether it is listed as a seafood ([math]) or not (). 3. 3.
Minor group: Described the species group used to prepare the suchi. The group is denoted by the categorical value between [math] and , i.e. it lies in the set . Refer to Table 8 for description of each group. 4. 4.
Oiliness/Heaviness: The amount of oil or fat present in the sushi, expressed as a real number between [math] and , where [math] indicates heavy/oil and oil-free. 5. 5.
Demand: The frequency with which the user demands the sushi, expressed as a real number between [math] and , where means most frequently and [math] not at all. 6. 6.
Normalized Price: The price of sushi normalized over the given sushis. 7. 7.
Supply: The frequency of selling a sushi in the shop, expressed as a real number between [math] and , where [math] indicates not at all and frequently.
Singleton Choice Data Conversion
For using the SUSHI dataset for singleton choice setting, we re-utilize the set of object in and choose the most preferred object as the singleton choice. We created the singleton choice dataset with instances, such that and for all . The singleton choice models are evaluated using -folds by train-test shuffle-split with 80 % train and 20 % test instances.
Appendix G Detailed Experimental Results
The following Tables 9, 10 and 11 contain all experimental results as discussed in Section 6.4 in numeric form for additional evaluation measures.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Charu C Aggarwal and Philip S Yu “Outlier detection for high dimensional data” In ACM Sigmod Record 30.2 , 2001, pp. 37–46 ACM
- 2[2] Qingyao Ai, Keping Bi, Jiafeng Guo and W. Croft “Learning a Deep Listwise Context Model for Ranking Refinement” In SIGIR ACM, 2018, pp. 135–144
- 3[3] Qingyao Ai et al. “Learning Groupwise Multivariate Scoring Functions Using Deep Neural Networks” In ICTIR ACM, 2019, pp. 85–92
- 4[4] Pavel Anselmo Alvarez, Alessio Ishizaka and Luis Martínez “Multiple-criteria decision-making sorting methods: A survey” In Expert Systems with Applications 183 , 2021, pp. 115368
- 5[5] Attila Ambrus and Kareen Rozen “Rationalising Choice with Multi‐Self Models” In The Economic Journal 125.585 , 2014, pp. 1136–1156 DOI: 10.1111/ecoj.12103 · doi ↗
- 6[6] Kenneth J Arrow “Social Choice and Individual Values” John Wiley & Sons, 1951
- 7[7] Richard R. Batsell and John C. Polking “A New Class of Market Share Models” In Marketing Science 4.3 INFORMS, 1985, pp. 177–198 URL: http://www.jstor.org/stable/183903
- 8[8] Peter W. Battaglia et al. “Relational inductive biases, deep learning, and graph networks” In Co RR abs/1806.01261 , 2018
