Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
Shijin Gong, Kai Ye, Jin Zhu, Xinyu Zhang, Hongyi Zhou, Chengchun Shi

TL;DR
This paper introduces a kernelized advantage estimation method for LLM reasoning that leverages nonparametric statistical techniques to improve policy learning under resource constraints.
Contribution
It applies kernel smoothing to value function estimation in LLMs, offering a resource-efficient alternative to existing methods with high accuracy.
Findings
Kernel smoothing improves value estimation accuracy.
The method reduces variance in policy gradient estimates.
Numerical and theoretical results validate the approach.
Abstract
Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. The second avoids training a value network by approximating the value function using sample averages. However, it samples a large number of reasoning traces per prompt for accurate value function approximation, making it computationally expensive. The third samples only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. This paper focuses on a practical,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
