On Pareto Optimality for Parametric Choice Bandits
Jierui Zuo, Hanzhang Qin

TL;DR
This paper develops a theoretical framework for online assortment optimization under stochastic choice, balancing revenue performance and inference quality, with explicit regret and error bounds for specific choice models.
Contribution
It introduces a unified OFU-based scheme with regularized likelihood estimators, deriving explicit regret and inference bounds for MNL and other models, and characterizes Pareto-optimal exploration rates.
Findings
Regret bound of tilde(n_T + T/\u221a{n_T}) for MNL.
Revenue-contrast error of tilde(1/sqrt{n_T}) for MNL.
Optimal exploration rate rom T^{2/3} to T^1, balancing regret and inference.
Abstract
We study online assortment optimization under stochastic choice when a decision maker simultaneously values cumulative revenue performance and the quality of post-hoc inference on revenue contrasts. We analyze a forced-exploration optimism-in-the-face-of-uncertainty (OFU) scheme that combines two regularized maximum-likelihood estimators: one based on all observations for sequential decision making, and one based only on exploration rounds for inference. Our general theory is developed under predictable score proxies and per-round action-dependent curvature domination. Under these conditions we establish a self-normalized concentration inequality, a likelihood-based ellipsoidal confidence-set theorem, and a regret bound for approximate optimistic actions that explicitly accounts for optimization error. For the multinomial logit (MNL) model we derive explicit score and curvature proxies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
