Policy of Thoughts: Scaling LLM Reasoning via Test-time Policy Evolution
Zhengbo Jiao, Hongyu Xian, Qinglong Wang, Yunpu Ma, Zhebo Wang, Zifan Zhang, Dezhang Kong, Meng Han

TL;DR
This paper introduces Policy of Thoughts (PoT), a novel framework that dynamically refines large language models' reasoning strategies during test time by learning from execution feedback, significantly improving complex reasoning performance.
Contribution
PoT recasts reasoning as an online optimization process, enabling real-time policy evolution within instances, which is a novel approach compared to prior static or external feedback methods.
Findings
PoT achieves 49.71% accuracy on LiveCodeBench with a 4B model.
PoT outperforms GPT-4o and DeepSeek-V3 despite smaller size.
Dynamic policy refinement improves reasoning accuracy significantly.
Abstract
Large language models (LLMs) struggle with complex, long-horizon reasoning due to instability caused by their frozen policy assumption. Current test-time scaling methods treat execution feedback merely as an external signal for filtering or rewriting trajectories, without internalizing it to improve the underlying reasoning strategy. Inspired by Popper's epistemology of "conjectures and refutations," we argue that intelligence requires real-time evolution of the model's policy through learning from failed attempts. We introduce Policy of Thoughts (PoT), a framework that recasts reasoning as a within-instance online optimization process. PoT first generates diverse candidate solutions via an efficient exploration mechanism, then uses Group Relative Policy Optimization (GRPO) to update a transient LoRA adapter based on execution feedback. This closed-loop design enables dynamic,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Materials Science · Multimodal Machine Learning Applications
