Adaptive Exploration for Multi-Reward Multi-Policy Evaluation

Alessio Russo; Aldo Pacchiano

arXiv:2502.02516·cs.LG·August 19, 2025

Adaptive Exploration for Multi-Reward Multi-Policy Evaluation

Alessio Russo, Aldo Pacchiano

PDF

Open Access

TL;DR

This paper introduces an adaptive exploration method for efficiently evaluating multiple policies across various reward functions in an online setting, achieving high-confidence estimates with reduced sample complexity.

Contribution

It extends multi-policy evaluation to a multi-reward setting under an PAC framework and proposes an efficient, instance-specific exploration strategy with convex approximation.

Findings

01

Effective adaptive exploration reduces sample complexity.

02

Method achieves high-confidence policy evaluation across reward sets.

03

Experimental results validate approach in tabular domains.

Abstract

We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(ϵ, δ)$ -PAC perspective to achieve $ϵ$ -accurate estimates with high confidence across finite or convex sets of rewards, a setting that has not been investigated in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic Policies and Impacts · Software Reliability and Analysis Research

MethodsADaptive gradient method with the OPTimal convergence rate