Adaptive Exploration for Multi-Reward Multi-Policy Evaluation
Alessio Russo, Aldo Pacchiano

TL;DR
This paper introduces an adaptive exploration method for efficiently evaluating multiple policies across various reward functions in an online setting, achieving high-confidence estimates with reduced sample complexity.
Contribution
It extends multi-policy evaluation to a multi-reward setting under an PAC framework and proposes an efficient, instance-specific exploration strategy with convex approximation.
Findings
Effective adaptive exploration reduces sample complexity.
Method achieves high-confidence policy evaluation across reward sets.
Experimental results validate approach in tabular domains.
Abstract
We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an -PAC perspective to achieve -accurate estimates with high confidence across finite or convex sets of rewards, a setting that has not been investigated in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEconomic Policies and Impacts · Software Reliability and Analysis Research
MethodsADaptive gradient method with the OPTimal convergence rate
