Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL

Tong Yang; Bo Dai; Lin Xiao; Yuejie Chi

arXiv:2506.22401·cs.LG·June 30, 2025

Exploration from a Primal-Dual Lens: Value-Incentivized Actor-Critic Methods for Sample-Efficient Online RL

Tong Yang, Bo Dai, Lin Xiao, Yuejie Chi

PDF

Open Access

TL;DR

This paper introduces a primal-dual perspective on exploration in online reinforcement learning, proposing a value-incentivized actor-critic method with theoretical guarantees for sample efficiency.

Contribution

It presents a novel VAC algorithm based on primal-dual optimization that unifies exploration and exploitation with theoretical performance bounds.

Findings

01

Achieves near-optimal regret in linear MDPs.

02

Provides a unified framework for exploration via primal-dual interpretation.

03

Extensible to general function approximation under certain conditions.

Abstract

Online reinforcement learning (RL) with complex function approximations such as transformers and deep neural networks plays a significant role in the modern practice of artificial intelligence. Despite its popularity and importance, balancing the fundamental trade-off between exploration and exploitation remains a long-standing challenge; in particular, we are still in lack of efficient and practical schemes that are backed by theoretical performance guarantees. Motivated by recent developments in exploration via optimistic regularization, this paper provides an interpretation of the principle of optimism through the lens of primal-dual optimization. From this fresh perspective, we set forth a new value-incentivized actor-critic (VAC) method, which optimizes a single easy-to-optimize objective integrating exploration and exploitation -- it promotes state-action and policy estimates that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Advanced Bandit Algorithms Research · Adaptive Dynamic Programming Control