Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution

Zhengbo Jiao; Hongyu Xian; Qinglong Wang; Yunpu Ma; Zhebo Wang; Zifan Zhang; Dezhang Kong; Meng Han

arXiv:2601.20379·cs.AI·January 29, 2026

Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution

Zhengbo Jiao, Hongyu Xian, Qinglong Wang, Yunpu Ma, Zhebo Wang, Zifan Zhang, Dezhang Kong, Meng Han

PDF

Open Access

TL;DR

This paper introduces Policy of Thoughts (PoT), a novel framework that dynamically refines large language models' reasoning strategies during test time by learning from execution feedback, significantly improving complex reasoning performance.

Contribution

PoT recasts reasoning as an online optimization process, enabling real-time policy evolution within instances, which is a novel approach compared to prior static or external feedback methods.

Findings

01

PoT achieves 49.71% accuracy on LiveCodeBench with a 4B model.

02

PoT outperforms GPT-4o and DeepSeek-V3 despite smaller size.

03

Dynamic policy refinement improves reasoning accuracy significantly.

Abstract

Large language models (LLMs) struggle with complex, long-horizon reasoning due to instability caused by their frozen policy assumption. Current test-time scaling methods treat execution feedback merely as an external signal for filtering or rewriting trajectories, without internalizing it to improve the underlying reasoning strategy. Inspired by Popper's epistemology of "conjectures and refutations," we argue that intelligence requires real-time evolution of the model's policy through learning from failed attempts. We introduce Policy of Thoughts (PoT), a framework that recasts reasoning as a within-instance online optimization process. PoT first generates diverse candidate solutions via an efficient exploration mechanism, then uses Group Relative Policy Optimization (GRPO) to update a transient LoRA adapter based on execution feedback. This closed-loop design enables dynamic,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Materials Science · Multimodal Machine Learning Applications