Dynamic Learning with Frequent New Product Launches: A Sequential Multinomial Logit Bandit Problem
Junyu Cao, Wei Sun

TL;DR
This paper introduces a sequential multinomial logit bandit model for online learning in dynamic markets with frequent new product launches, providing algorithms with regret bounds and strategies to mitigate risks in learning new products.
Contribution
It develops a novel SMNL model for tiered recommendations, proposes polynomial-time algorithms for offline and online learning, and extends the model to include learning accuracy constraints for new products.
Findings
Proposed a polynomial-time algorithm for offline preference learning.
Designed an online learning algorithm with quantifiable regret bounds.
Extended the model to ensure learning accuracy for new products.
Abstract
Motivated by the phenomenon that companies introduce new products to keep abreast with customers' rapidly changing tastes, we consider a novel online learning setting where a profit-maximizing seller needs to learn customers' preferences through offering recommendations, which may contain existing products and new products that are launched in the middle of a selling period. We propose a sequential multinomial logit (SMNL) model to characterize customers' behavior when product recommendations are presented in tiers. For the offline version with known customers' preferences, we propose a polynomial-time algorithm and characterize the properties of the optimal tiered product recommendation. For the online problem, we propose a learning algorithm and quantify its regret bound. Moreover, we extend the setting to incorporate a constraint which ensures every new product is learned to a given…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Optimization and Search Problems · Data Stream Mining Techniques
Dynamic Learning with Frequent New Product Launches:
A Sequential Multinomial Logit Bandit Problem
Abstract
Motivated by the phenomenon that companies introduce new products to keep abreast with customers’ rapidly changing tastes, we consider a novel online learning setting where a profit-maximizing seller needs to learn customers’ preferences through offering recommendations, which may contain existing products and new products that are launched in the middle of a selling period. We propose a sequential multinomial logit (SMNL) model to characterize customers’ behavior when product recommendations are presented in tiers. For the offline version with known customers’ preferences, we propose a polynomial-time algorithm and characterize the properties of the optimal tiered product recommendation. For the online problem, we propose a learning algorithm and quantify its regret bound. Moreover, we extend the setting to incorporate a constraint which ensures every new product is learned to a given accuracy. Our results demonstrate the tier structure can be used to mitigate the risks associated with learning new products.
Keywords*:* sequential, multinomial logit model, bandit, new product, dynamic
1 Introduction
Facing increasingly savvy customers whose preferences are rapidly changing, companies that choose to play it safe by remaining with traditional product lines risk being overtaken by competitors more in tune with their customers. A coping strategy adopted by companies is to frequently launch new products and learn from the market responses. Between the tried-and-true existing products and new products with little or no history, companies face a dilemma - they have to offer new products in order to understand the changing market dynamics so as to improve longer-term profitability, yet they may have to sacrifice short-term profitability. The central question is, how can a company quickly learn customers’ preferences while mitigating the risks inherent in new products?
We approach this question as an online learning task. We consider a seller whose goal is to maximize cumulative profit over a selling horizon . She will introduce several new products at different times during the selling period. For every customer, the seller determines some products to offer111We use “recommend” and “offer” interchangeably in this work. , which may include the existing and/or new products. Based on the customer’s response, the seller updates her belief on the latent customers’ preferences (also known as product valuations), and uses the information to optimize the product selection for the next customer. As we will show in the paper, many new products with relatively low profit will never be offered from a pure profit-maximizing objective.
In reality, companies often intentionally price new products low to gain exposure and to entice customers to give them a try. Thus, many new products may have relatively low profits, yet learning from these product is crucial for understanding customers’ preferences, and enabling companies to make better business decisions in the future. To model such behavior, we impose a constraint, termed “minimum learning criterion”, which requires every new product to be offered and learned to a given accuracy. A direct implication is that the seller will be bearing additional cost of learning as she makes less money from these products. It is natural to ask what can be done to reduce such cost.
We will show that a judicious choice of presenting products is capable of mitigating some costs associated with learning new products. In our setting, products are presented in tiers, indicating the precedence in which customers discover them. For a given customer, a seller first offers the products on the first tier. If none are selected, the seller then presents the second tier, and so on. Priorities are embedded in tiers as product placement affects product visibility. Such product offerings are ubiquitous in the online marketplace. For instance, when companies send multiple emails or app notifications to promote products, or on a website where products are displayed over multiple pages and customers have to take an action (such as clicking on “Next” or ”Load more”) to access the next set of products.
To capture the customer’s responses when recommendations come in tiers, we propose a sequential multinomial logit (SMNL) model, which generalizes the multinomial logit (MNL) model that has been extensively studied in the literature (e.g., Agrawal et al., 2017a, ; Talluri and Van Ryzin,, 2004; Train,, 2009). Besides offering priorities in which products are being shown, we will prove additional benefits of tiered product recommendation, i.e., i) it is capable of achieving higher profit than displaying all products at once, and ii) reducing the profit risks associated with new products.
In this paper, we refer to the online task of learning customers’ preferences through tier-based product recommendations as the SMNL Bandit problem. The contribution of our work is threefold:
We propose a novel SMNL model to capture customer’s sequential choice behavior. For the offline problem with known customers’ preferences, we provide a polynomial-time algorithm to solve the profit-maximization problem, and characterize the properties of the optimal tiered product offering. 2. 2.
For an online setting where new products are frequently launched at different times during the selling horizon, we propose an online learning algorithm for the SMNL Bandit problem, and characterize its regret bound. 3. 3.
We extend the online setting to incorporate a constraint which ensures all new products are learned to a given accuracy, and demonstrate how the tier structure in product presentation can be exploited to mitigate risks with new products.
2 Literature review
The first stream of work that our paper is related to is assortment optimization. It refers to the problem of selecting a set of products to offer to a group of customers so as to maximize the revenue when customers make purchases according to their preferences. It is a central topic in economics, marketing, and the operations management research literature. We refer the reader to Kök et al.,, 2008 for a comprehensive review. Talluri and Van Ryzin,, 2004 is the first paper that models customers’ preferences with the MNL model for the assortment planning problem. Flores et al.,, 2018 study the assortment optimization problem with a different sequential choice model known as the perception-adjusted Luce model and characterize the optimal assortment for the offline problem. Besides the customers’ preferences are modelled differently, we also study the problem in the online setting and investigate the learning policy with new products.
Another related topic is the multi-armed bandit (MAB) problem (e.g., Robbins,, 1985; Sutton et al.,, 1998). Our problem falls under the combinatorial setting (Chen et al.,, 2013) since the retailer’s decision is a combination of different products. A naive approach is to treat each possible combination as an arm. However, the number of arms increases exponentially with the number of products with this approach. Other combinatorial bandit work assuming linear reward (Auer,, 2002; Rusmevichientong and Tsitsiklis,, 2010) or independent rewards (Chen et al.,, 2013) cannot be directly applied to our model. Recent work on assortment optimization (such as Cheung and Simchi-Levi,, 2017; Agrawal et al., 2017a, ; Agrawal et al., 2017b, ; Sauré and Zeevi,, 2013; Rusmevichientong et al.,, 2010) extend the MNL assortment problem from the offline setting to online where customers’ preferences are unknown a priori and need to be learned. Our work is more closely related to Agrawal et al., 2017a, , but with the following key differences. Firstly, we consider multi-tiered assortment. Despite their ubiquity in practice, there is little formal analysis in the literature on either the offline optimization problem or the online learning algorithms. Our work helps to bridge this gap. Secondly, we focus on learning in conjunction with new products launches, where we differentiate two cases depending on whether all new products need to be learned.
3 Problem formulation
In this section, we will formally set up our problem. We will first introduce the SMNL model, which describes the customers’ behavior, and follow by formulating a profit maximization problem that the seller needs to solve.
3.1 Customer’s behavior: SMNL model
Discrete choice models such as the popular MNL model are derived under the assumption that a utility-maximizing customer chooses a product with the highest valuation among a available choice set Train, (2009). In a SMNL model, consists of multiple tiers of products. For ease of notation, we will present a two-tier model where the choice set consists of two sets, i.e., . We will refer to and as the priority tier and the secondary tier respectively, as products in enjoys greater visibility. Note that all our results can be generalized to incorporate more tiers.
Customers arrive at discrete time . For a customer arriving at time , she is presented with a choice set that is selected by the seller. Under the SMNL model, a customer first considers products from the priority tier . If none are selected, she will then consider the secondary tier and decide whether to select any product from . Note that no-purchase is also one of the choices that the customer can make. The probability that a customer purchases product is denoted as and no-purchase as , i.e.,
[TABLE]
where is the product valuation or customers’ preference for product , which is assumed to be less than 1. For a product from , its purchase probability follows that of a standard MNL model. On the other hand, the probability of purchasing a product from , is the joint probability of two events, i.e., the customer has not selected any product from and the customer selects a product from .
3.2 Seller’s profit maximization problem
Knowing customers’ purchase probability as when offering , the seller needs to select a subset of products from all available products to form and . We assume there are two pre-determined sets of product candidates, and . We want to point out that the two candidate sets need not be mutually exclusive, and can completely overlap each other. A seller has the flexibility to assign products as candidates for the priority tier based on sales, trendiness, inventory, and other business criteria.
Denote the profit of product by and the profit obtained from by . The expected profit can be expressed as The seller’s optimization problem is to select two subsets of products and from the candidate sets and respectively. That is,
[TABLE]
We use to denote the optimal tiered product offering.
4 Characteristics of the optimal tiered product offering
We begin this section with a simple example to compare a two-tiered product offering with its single-tiered counterpart.
**Example 1. ** Suppose there are two products with profit and valuation respectively. The optimal one-tier recommendation is to offer both products simultaneously and the corresponding expected profit is given by The optimal two-tier recommendation is to offer product 1 on the priority tier and product 2 on the secondary tier. The resulting profit .
This example shows that the tiered structure offers flexibility in presenting products, which translates into higher profit. Intuitively, the tiered recommendation prioritizes products with higher profits to be shown first. We can formalize this observation by analyzing the seller’s problem (3.1) in an offline setting where the product valuation is given.
We now introduce two definitions which will help us characterize the properties of the optimal tiered product offering.
Definition 4.1** (Profit-ordered set)**
We call is a profit-ordered set if for .
Definition 4.2** (Profit-ordered by tier)**
*If there exist and such that , then is not profit-ordered by tier. Otherwise, it is profit-ordered by tier. *
Example 4.3
Suppose , , with profit for all , then the sets , are both profit-ordered by tier while the sets , are not.
Proposition 4.4
The optimal product offering to the optimization problem (3.1) in each tier is a profit-ordered set. In addition, is profit-ordered by tier.
Due to the space constraint, we only include proof sketches for the key results in the paper. All detailed proofs can be found in the supplementary material.
Proof sketch: We show is profit-ordered by contradiction. Supposedly, there exists a where and , then we show that removing this product will increase the expected profit. Hence, is not optimal. A similar argument is used to show that if , and , then adding it to the offering will increase the profit. Next, use the same argument to to obtain the desired result.
To prove is profit-ordered by tier, notice that the expected profit of is at least as large as only offering since is also a feasible solution. Since we have shown that each tier in is a profit-ordered set, i.e., for any , , and for any , . Therefore, for any . This completes the proof.
Proposition 4.4 implies that a two-tier optimal recommendation can be characterized by a pair of profit thresholds with , where and for any and . Therefore, the seller’s optimization problem is polynomial-time solvable, as it follows directly from the fact that there are at most pairs of profit thresholds to enumerate through. In retail, as prices are discrete and often end with 9 or .99, there are far fewer unique price points than the number of products and the actual search space of profit thresholds is significantly smaller.
The profit-ordered structure of the optimal tiered recommendation provides important insights regarding the placement of a new product. We will generalize the result to a setting with multiple tiers.
Proposition 4.5
Denote the optimal recommendation before and after including a new product with profit to a candidate set as and , respectively. Define . The following properties holds.
- a.)
* for any .* 2. b.)
If for some , then but . 3. c.)
If , then
Proposition 4.5 states that, for a two-tier product offering, unless a new product’s profit is higher than , where refers to what is currently being offered on the secondary tier, it will not be included. Therefore, this product will never be introduced or learned. As discussed in the introduction, many new products could have relatively low profit, but learning is crucial for providing insights to improve long-term profitability. This provides motivation for us to investigate an online learning task with a constraint to ensure all new products are learned to a given accuracy, which we will discuss in Section 6.
5 Learning product valuations
In the previous section, we have assumed that valuations of products are known. In practice, these quantities are not given to the seller and have to be learned.
5.1 Online setup
We consider a general setting where new products are introduced at different time stamps during a selling horizon . We allow several products to be launched at the same time. We use regret to measure the performance of a learning algorithm, where the regret for a policy is defined as,
[TABLE]
where is the optimal tiered product offering when is known, while is the tiered recommendation offered to the customer arriving at time . denotes the profit accrued at time when offering recommendation .
For our learning task, we extend the framework in Agrawal et al., 2017a, which proposed a UCB-based algorithm for an online learning task with a MNL model. We want to emphasize that the tiered structure in the SMNL model significantly complicates the analysis as the decisions across the tiers are interdependent. Next, we will describe a counting process to derive an unbiased estimator of for .
5.2 Unbiased estimator on product valuation
We divide the time horizon into epochs for the priority and the secondary tier respectively, i.e., and . Let . In each epoch for , we offer the same product selection for tier until a no-purchase in occurs. An epoch is labeled as if and only if epochs have been completed before . Let contain all time steps during epoch when is shown to a customer.
Example 5.1
Figure 1 illustrates the counting process with an example, which shows the purchase decisions of 9 customers, i.e., . The first customer selects a product from the priority tier, and the second customer selects a product from the secondary tier, and so on. The table in Figure 1 shows how epochs are labeled for different tiers. Here we have and . For , the epoch count at time is the same as the total number of no-purchases from both tiers before time . Thus, when , epoch since there is a total of 3 no-purchases across both tiers by . Note that for the secondary tier , we only keep track of the time steps and the epoch count when is shown to a customer (i.e., the customer does not purchase any product from ). In terms of the time steps for each epoch, we have , , , , , and .
For any time step , we use to denote the purchase decision of customer on tier , i.e., if the consumer purchased product , and 0 for a no-purchase. For any product and , define and as the number of times a product is purchased in epoch as part of the primary or secondary tier selections respectively.
Let be the set of epochs which contain product in tier offering before epoch . Define , which denotes the number of epochs which contain in tier offering before epoch . Let , as the total number of epochs which contain in the tiered recommendation before epoch . We compute as the average number of times product is purchased per epoch, i.e.,
[TABLE]
Lemma 5.2
* are i.i.d. geometric random variables with parameter for any and . Therefore, they are unbiased i.i.d. estimators of .*
5.3 Learning algorithm for SMNL bandit
Define the upper confidence bound on as the follows,
[TABLE]
where is the initial launch epoch of product , is defined in Equation (5.1), and is the total number of products.
We briefly describe our UCB-based algorithm: In each epoch , we use to compute the optimal product offering . Denote as the optimal product set when the value of products is and is the optimal set selected from the entire candidate sets including the new product. To bound the profit difference between and , we derive the following result.
Lemma 5.3
Assume for all . Suppose is an optimal tiered recommendation when the parameters of SMNL model are given by . Then .
Lemma 5.3 is a key step in the regret analysis for this UCB-based algorithm. With Lemma 5.3, on the “large probability” event that for all , we can bound the difference by . We will expand the regret analysis with more details in next section, where we impose an additional constraint to our learning task, as the current setting is a special case when the constraint is absent.
6 Regret analysis with the minimum learning criterion
As we have discussed in Section 4, by default a new product will only be included in the product offering if its profit , where is the current product offering at the secondary tier. In other words, new products with profit will never be offered and and deprived of the learning opportunity. To have a more realistic setting, we will formally define a minimum learning constraint. We will then investigate a learning algorithm and quantify its resulting regret, starting with a single new product and later generalize to multiples.
6.1 Minimum learning criterion
We impose a constraint in our learning task to ensure that every product will be offered for at least a number of times to allow us to learn its valuation to a certain accuracy. More specifically, we require the estimated valuation of every new product to be within to the true with a probability which is at least , where and are two pre-determined parameters. We derive the following lemma which specifies the number of epochs needed to achieve a given level of estimation accuracy.
Lemma 6.1** (Minimum learning criterion)**
For any and , if the number of epochs , then is within the confidence bound of with probability at least . That is, if .
We want to emphasize that the constraint only affects a subset of new products which are otherwise excluded from being offered due to their relatively low profitability. Once they are offered and samples have been collected, they will be dropped out from future product recommendations. On the other hand, new products (along with some existing products) with relatively high profit will continuously be offered after epochs and the estimation on their product valuations will be further improved. This is echoing what typically happens after product launches, where companies choose to continue or stop certain new products based on market response.
6.2 Learning with
In this section, we focus on with a setting when a single new product with low profit is launched in the middle of a selling horizon. Part of our goal is to determine the best way to include this product into learning.
By Proposition 4.5, this low-profit product will be excluded from learning by default. In order to satisfy the minimum learning criterion, this new product will have to be offered for epochs, where is determined by Lemma 6.1. There are two possible strategies for us to learn this new product, i.e., either assigning it to the priority tier or the secondary tier.
The answer to which is a better strategy is not immediately clear: While the duration of an epoch is shorter when a product is placed on the priority tier, it could also mean that more of this product will be purchased. Hence, more profit loss and higher regret. On the other hand, even though a product placed on the secondary tier might make fewer sales, the duration of a single epoch could be much longer and the resulting regret could still be high since other products (in addition to the new product) also contribute to the total regret. We now formally compare the two strategies by quantifying the corresponding regrets incurred during a single epoch.
Strategy 1: Assigning new product to the priority tier
Let , , and . Let denote the number of times has been shown to customers until a no-purchase occurs. Note that follows the geometric distribution with mean , which depends on the valuation of all products in .
Define the regret function during one epoch when the new product is included in the first tier as , i.e.,
[TABLE]
where and .
Strategy 2: Assigning new product to the secondary tier
Let , , and . denotes the number of times has been shown to customers until a no-purchase from the entire product offering (i.e., both tiers). follows the geometric distribution with mean .
Similarly, we define the corresponding regret function as follows,
[TABLE]
where and .
To compare the two strategies, we first need to determine the optimal action under a given strategy, then evaluate its “best” loss. The strategy which yields the lower regret is then considered a “better” strategy. Let and denote the optimal solution that minimizes the regret and , respectively, i.e., and .
Theorem 6.2
*The optimal solution to and is the same as . That is, In addition, we have *
The implication of Theorem 6.2 is twofold. Firstly, it shows that the optimal offerings excluding the new product are identical for both strategies, irrespective of which tier the new product has been added to. In addition, they are also the same as the optimal offering before the new product is added. In other words, there is no need to resolve the optimization problem with the added new product. Thus, it provides a simple learning algorithm for a new product with : It is optimal to just add it to the secondary tier of the existing optimal product offering to satisfy the learning criterion.
Secondly, Theorem 6.2 also shows that with this optimal product offering , the regret is lower when the new product is added to the secondary tier. This result highlights the advantage of showcasing product recommendations in multiple tiers, in the sense we incur a smaller loss by displaying new products with higher risks (i.e., lower profit) on tiers with lower priorities.
6.3 Learning with multiple new products
This section focuses on a general setting similar to the one addressed in Section 5.1, except with the minimum learning constraint in place.
We propose Algorithm 1 to dynamically offer the recommendation which simultaneously explores and exploits. In Algorithm 1, for each epoch , we compute the optimal tiered recommendation given valuation . Based on Proposition 6.2, for any new product which is not included in , we add it to the second tier . At the end of each epoch, we update and , which will be used to compute the recommendation for the next epoch.
We are now ready to present an upper bound on the regret for Algorithm 1. We provide a proof sketch here and the detailed proof can be found in the Supplementary Material.
Theorem 6.3** (Performance bound for Algorithm 1)**
The regret during time is bounded above by
[TABLE]
for some constant , where is the highest profit of products among , and is the total number of products.
Proof sketch: We first rewrite the regret in terms of the epochs. Note that one learning epoch on the secondary tier may correspond to multiple learning epochs on the priority tier. Let denote as a set of epochs on tier 1 which corresponds to epoch . In Example 2 as shown in Figure 1, we have , . Thus, the regret until time can be expressed , where the set denotes the set of new products with low profit which are added to the second tier at epoch to satisfy the minimum learning criterion.
Define the “large probability” event Meanwhile, by Lemma 5.3, we have . Thus, conditional on the event and Lemma 5.3, we can show that can be bounded above by .
We see that the regret consists of two parts: The first term can be bounded above by
The second term can also be bounded since each product will be included in the set for at most times.
Combined this result on the “large probability” event with the error on the measure of “small probability” event , the upper bound of the regret can be obtained.
We want to point out Algorithm 1 can be easily extended to include more than two tiers, and Theorem 6.3 will continue to hold. The regret bound in Theorem 6.3 consists of three terms, where the first two terms account for the estimation error on product valuation, while the third term is linear with , representing the price one has to pay in order to include new products with low profit into learning. When , Theorem 6.3 provides the regret bound for the case without the minimum learning criterion, which is a special case discussed in Section 5.3.
7 Numerical experiments
In this section, we conduct three experiments. We first investigate the robustness of Algorithm 1. Next, we compare Algorithm 1 that simultaneously explores and exploits with a benchmark algorithm which separates the two phases. Lastly, we compare our algorithm with an alternative strategy for learning new products.
Experiment 1 (Robustness study)
We consider a setting where contains 80 products with profit uniformly distributed on [0,1] and 20 products with uniformly distributed [0,0.2]. We compare four scenarios, when the product valuation is uniformly distributed on [0,0,1], [0,0.2], [0,0.3], and [0,0.5]. A new product is introduced after every 800 time steps. We set for the minimum learning criteria.
Figure 2 shows the results based on 10 independent simulations for different distributions of . The average regrets are 129.87, 243.38, 348.31, and 620.14 for the four scenarios. Notice that both the mean and variance of the regret are increasing with the support of . It implies that the learning process is harder when the product valuations lie on a larger support and have higher variability.
Experiment 2 (Comparison with a explore-then-exploit benchmark)
The benchmark we consider is adapted from Sauré and Zeevi,, 2013. As shown in Section 4, there are at most candidates which are profit-ordered by tier. In the exploration phase of the benchmark algorithm, every candidate whose profit is higher than the current optimum is offered for at least times, where is a tuning parameter. In the exploitation phase, the algorithm uses the estimated parameters to determine a tiered offering with the highest expected profit and offer it to all customers.
For the experiment, consider the setting that contains 12 products, where the profit of 8 of them are uniformly distributed on [0,1], and that of 4 products on [0,0.2]. The valuation is uniformly distributed on [0,0.1]. For ease of comparison, all products are launched at . Set .
Figure 3 shows the results based on 10 independent simulation. It depicts the superiority of our algorithm over the benchmark, where the average regrets are 14.39 and 247.78 under Algorithm 1 and the benchmark respectively.
Experiment 3 (Comparison with an alternative learning strategy for new products)
We have shown in Algorithm 1 that new products with profit lower than will be added to the secondary tier. In this experiment, we compare it with an alternative strategy where those new products with low profit will be randomly added to either tier with equal probability for learning. To be precise, we consider a setting where contains 20 products with profit uniformly distributed on [0.5,1] and valuation on [0,0.1]. contains 30 products with profit uniformly distributed on [0,0.6] and valuation on [0,0.2]. We compute the optimal product offering as the current offering based on these values. Next, we assume 15 new products with profit uniformly distributed on [0,0.55] and valuation on [0,0.3] are launched at time . For the benchmark, new products with profit below will be randomly added to one of the tiers. Set .
As shown in Figure 4, the average regrets are 102.21 under Algorithm 1 and 178.00 under the alternative strategy. It highlights the benefit of having a tiered offering as one could use the secondary tier to mitigate some profit risk when learning with new products.
8 Conclusion
In this work, we studied a product selection problem with a SMNL model which specifies the order in which products are being presented. For the offline setting where the product valuations are known, a polynomial-time solvable algorithm was provided. For the online setting, we analyzed a novel setup where multiple new products could arrive in the middle of a selling period. Depending on the presence of the minimum learning criterion, we proposed an online algorithm and characterized its regret.
There are several future directions of this work. For instance, products’ valuations may vary with time, especially for fashion and technology products. Thus, there is a need for an online algorithm that learns the dynamic valuations. In addition, it would be interesting to utilize customer attribute data and historical sales data to provide personalized recommendations.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1(1) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. (2017 a). Mnl-bandit: a dynamic learning approach to assortment selection. ar Xiv preprint ar Xiv:1706.03880 .
- 2(2) Agrawal, S., Avadhanula, V., Goyal, V., and Zeevi, A. (2017 b). Thompson sampling for the mnl-bandit. ar Xiv preprint ar Xiv:1706.00977 .
- 3Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research , 3(Nov):397–422.
- 4Chen et al., (2013) Chen, W., Wang, Y., and Yuan, Y. (2013). Combinatorial multi-armed bandit: General framework and applications. In International Conference on Machine Learning , pages 151–159.
- 5Cheung and Simchi-Levi, (2017) Cheung, W. C. and Simchi-Levi, D. (2017). Thompson sampling for online personalized assortment optimization problems with multinomial logit choice models.
- 6Flores et al., (2018) Flores, A., Berbeglia, G., and Van Hentenryck, P. (2018). Assortment optimization under the sequential multinomial logit model. European Journal of Operational Research .
- 7Kök et al., (2008) Kök, A. G., Fisher, M. L., and Vaidyanathan, R. (2008). Assortment planning: Review of literature and industry practice. In Retail supply chain management , pages 99–153. Springer.
- 8Robbins, (1985) Robbins, H. (1985). Some aspects of the sequential design of experiments. In Herbert Robbins Selected Papers , pages 169–177. Springer.
