Coordinate Ascent for Off-Policy RL with Global Convergence Guarantees
Hsin-En Su, Yen-Ju Chen, Ping-Chun Hsieh, Xi Liu

TL;DR
This paper introduces CAPO, a coordinate ascent-based off-policy RL algorithm that guarantees global convergence without requiring distribution correction, and demonstrates its effectiveness with neural policies.
Contribution
The paper proposes CAPO, an off-policy actor-critic method that avoids distribution mismatch issues and provides theoretical convergence guarantees.
Findings
CAPO converges globally under general coordinate selection.
CAPO achieves competitive performance in experiments.
Extended CAPO to neural policies for practical use.
Abstract
We revisit the domain of off-policy policy optimization in RL from the perspective of coordinate ascent. One commonly-used approach is to leverage the off-policy policy gradient to optimize a surrogate objective -- the total discounted in expectation return of the target policy with respect to the state distribution of the behavior policy. However, this approach has been shown to suffer from the distribution mismatch issue, and therefore significant efforts are needed for correcting this mismatch either via state distribution correction or a counterfactual method. In this paper, we rethink off-policy learning via Coordinate Ascent Policy Optimization (CAPO), an off-policy actor-critic algorithm that decouples policy improvement from the state distribution of the behavior policy without using the policy gradient. This design obviates the need for distribution correction or importance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
