Policy Newton Algorithm in Reproducing Kernel Hilbert Space

Yixian Zhang; Huaze Tang; Chao Wang; Wenbo Ding

arXiv:2506.01597·cs.LG·June 3, 2025

Policy Newton Algorithm in Reproducing Kernel Hilbert Space

Yixian Zhang, Huaze Tang, Chao Wang, Wenbo Ding

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Policy Newton in RKHS, a novel second-order optimization method for RL policies in Reproducing Kernel Hilbert Spaces, enabling faster convergence and better performance than existing first-order methods.

Contribution

It develops the first second-order optimization framework for RKHS-based RL policies, transforming an infinite-dimensional problem into a finite-dimensional one with theoretical guarantees.

Findings

01

Achieves quadratic convergence to a local optimum.

02

Demonstrates superior convergence speed over first-order methods.

03

Attains higher episodic rewards on benchmark tasks.

Abstract

Reinforcement learning (RL) policies represented in Reproducing Kernel Hilbert Spaces (RKHS) offer powerful representational capabilities. While second-order optimization methods like Newton's method demonstrate faster convergence than first-order approaches, current RKHS-based policy optimization remains constrained to first-order techniques. This limitation stems primarily from the intractability of explicitly computing and inverting the infinite-dimensional Hessian operator in RKHS. We introduce Policy Newton in RKHS, the first second-order optimization framework specifically designed for RL policies represented in RKHS. Our approach circumvents direct computation of the inverse Hessian operator by optimizing a cubic regularized auxiliary objective function. Crucially, we leverage the Representer Theorem to transform this infinite-dimensional optimization into an equivalent,…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* Paper is well written and follows a clear structure. * Rigorous theoretical analysis with resulting guarantees. * Experimental evaluations show significant performance improvements.

Weaknesses

* A minimisation problem over $J(\pi_\theta)$ is introduced in Sec. 2.1. Yet, $J$ is formulated as the expected cumulative reward, which an agent should be seeking to maximise, instead of minimise. The result of the regularised Newton step in Eq. 5 also seems to be leading in a descent, instead of ascent, direction. * Experimental evaluation is limited to a toy experiment and relatively simple classic RL problems (e.g., CartPole). * Notation for temperature and trajectories set use the same symb

Reviewer 02Rating 8Confidence 3

Strengths

The paper is written clearly, and seems to be correct. Results are novel and interesting, the proposed method may have a strong impact.

Weaknesses

I do not see any major weaknesses, however I can highlight a couple of minor issues: - Section 4.3 seems unnecessary, it’s just a re-statement of known results about convergence rate of Newton’s method on strongly convex losses. Perhaps this space could be used instead to extend Section 3, which represents the main contribution and it’s not very easy to grasp. - Line 94: “The objective of RL is to minimize…” It should be “maximize”. Similarly, the following equation should be “argmax”, not “a

Reviewer 03Rating 6Confidence 2

Strengths

The authors provide extensive theory for their method, proving a quadratic convergence rate. The empirical evaluation results reflect the superior convergence rate.

Weaknesses

1. The authors evaluate their method only on three tasks, which are all very low-dimensional, discrete, and relatively simple. I encourage the authors to add more tasks to the evaluation. Is the method also applicable to continuous control tasks? 2. While the proposed method achieves the highest reward out of the methods compared, it still does not seem to solve the LunarLander task consistently (the Gymnasium documentation specifies a reward threshold of 200 for an episode to be considered sol

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMetaheuristic Optimization Algorithms Research