Disentangling Exploration from Exploitation
Alessandro Lizzeri, Eran Shmaya, Leeat Yariv

TL;DR
This paper analyzes the optimal experimentation policy in Poisson bandits when exploration and exploitation are separated, showing that complete learning is achievable asymptotically and that the policy is complex and not indexable.
Contribution
It characterizes the optimal policy for disentangled exploration and exploitation in Poisson bandits, extending the understanding beyond traditional intertwined approaches.
Findings
Optimal policy achieves complete learning asymptotically
Policy exhibits persistence and complexity
Disentanglement is especially beneficial for intermediate parameters
Abstract
Starting from Robbins (1952), the literature on experimentation via multi-armed bandits has wed exploration and exploitation. Nonetheless, in many applications, agents' exploration and exploitation need not be intertwined: a policymaker may assess new policies different than the status quo; an investor may evaluate projects outside her portfolio. We characterize the optimal experimentation policy when exploration and exploitation are disentangled in the case of Poisson bandits, allowing for general news structures. The optimal policy features complete learning asymptotically, exhibits lots of persistence, but cannot be identified by an index a la Gittins. Disentanglement is particularly valuable for intermediate parameter values.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques
