Exploring Offline Policy Evaluation for the Continuous-Armed Bandit Problem
Jules Kruijswijk, Petri Parvinen, Maurits Kaptein

TL;DR
This paper extends an offline policy evaluation method to the continuous-armed bandit problem, enabling reliable evaluation and selection of policies using empirical data without costly field trials.
Contribution
We adapt and empirically validate an offline evaluation method for continuous-armed bandit policies, broadening its applicability beyond discrete actions.
Findings
The extended method provides consistent policy rankings.
It can be used effectively for policy selection in real-world CAB problems.
The approach relaxes assumptions compared to simulation-based evaluation.
Abstract
The (contextual) multi-armed bandit problem (MAB) provides a formalization of sequential decision-making which has many applications. However, validly evaluating MAB policies is challenging; we either resort to simulations which inherently include debatable assumptions, or we resort to expensive field trials. Recently an offline evaluation method has been suggested that is based on empirical data, thus relaxing the assumptions, and can be used to evaluate multiple competing policies in parallel. This method is however not directly suited for the continuous armed (CAB) problem; an often encountered version of the MAB problem in which the action set is continuous instead of discrete. We propose and evaluate an extension of the existing method such that it can be used to evaluate CAB policies. We empirically demonstrate that our method provides a relatively consistent ranking of policies.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Bandit Algorithms Research · Machine Learning and Algorithms · Reinforcement Learning in Robotics
