Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Shijin Gong; Kai Ye; Jin Zhu; Xinyu Zhang; Hongyi Zhou; Chengchun Shi

arXiv:2604.28005·cs.LG·May 19, 2026

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Shijin Gong, Kai Ye, Jin Zhu, Xinyu Zhang, Hongyi Zhou, Chengchun Shi

PDF

TL;DR

This paper introduces a kernelized advantage estimation method for LLM reasoning that leverages nonparametric statistical techniques to improve policy learning under resource constraints.

Contribution

It applies kernel smoothing to value function estimation in LLMs, offering a resource-efficient alternative to existing methods with high accuracy.

Findings

01

Kernel smoothing improves value estimation accuracy.

02

The method reduces variance in policy gradient estimates.

03

Numerical and theoretical results validate the approach.

Abstract

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. The second avoids training a value network by approximating the value function using sample averages. However, it samples a large number of reasoning traces per prompt for accurate value function approximation, making it computationally expensive. The third samples only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. This paper focuses on a practical,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.