Policy Newton Algorithm in Reproducing Kernel Hilbert Space
Yixian Zhang, Huaze Tang, Chao Wang, Wenbo Ding

TL;DR
This paper introduces Policy Newton in RKHS, a novel second-order optimization method for RL policies in Reproducing Kernel Hilbert Spaces, enabling faster convergence and better performance than existing first-order methods.
Contribution
It develops the first second-order optimization framework for RKHS-based RL policies, transforming an infinite-dimensional problem into a finite-dimensional one with theoretical guarantees.
Findings
Achieves quadratic convergence to a local optimum.
Demonstrates superior convergence speed over first-order methods.
Attains higher episodic rewards on benchmark tasks.
Abstract
Reinforcement learning (RL) policies represented in Reproducing Kernel Hilbert Spaces (RKHS) offer powerful representational capabilities. While second-order optimization methods like Newton's method demonstrate faster convergence than first-order approaches, current RKHS-based policy optimization remains constrained to first-order techniques. This limitation stems primarily from the intractability of explicitly computing and inverting the infinite-dimensional Hessian operator in RKHS. We introduce Policy Newton in RKHS, the first second-order optimization framework specifically designed for RL policies represented in RKHS. Our approach circumvents direct computation of the inverse Hessian operator by optimizing a cubic regularized auxiliary objective function. Crucially, we leverage the Representer Theorem to transform this infinite-dimensional optimization into an equivalent,…
Peer Reviews
Decision·ICLR 2026 Poster
* Paper is well written and follows a clear structure. * Rigorous theoretical analysis with resulting guarantees. * Experimental evaluations show significant performance improvements.
* A minimisation problem over $J(\pi_\theta)$ is introduced in Sec. 2.1. Yet, $J$ is formulated as the expected cumulative reward, which an agent should be seeking to maximise, instead of minimise. The result of the regularised Newton step in Eq. 5 also seems to be leading in a descent, instead of ascent, direction. * Experimental evaluation is limited to a toy experiment and relatively simple classic RL problems (e.g., CartPole). * Notation for temperature and trajectories set use the same symb
The paper is written clearly, and seems to be correct. Results are novel and interesting, the proposed method may have a strong impact.
I do not see any major weaknesses, however I can highlight a couple of minor issues: - Section 4.3 seems unnecessary, it’s just a re-statement of known results about convergence rate of Newton’s method on strongly convex losses. Perhaps this space could be used instead to extend Section 3, which represents the main contribution and it’s not very easy to grasp. - Line 94: “The objective of RL is to minimize…” It should be “maximize”. Similarly, the following equation should be “argmax”, not “a
The authors provide extensive theory for their method, proving a quadratic convergence rate. The empirical evaluation results reflect the superior convergence rate.
1. The authors evaluate their method only on three tasks, which are all very low-dimensional, discrete, and relatively simple. I encourage the authors to add more tasks to the evaluation. Is the method also applicable to continuous control tasks? 2. While the proposed method achieves the highest reward out of the methods compared, it still does not seem to solve the LunarLander task consistently (the Gymnasium documentation specifies a reward threshold of 200 for an episode to be considered sol
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMetaheuristic Optimization Algorithms Research
