RL of Thoughts: Navigating LLM Reasoning with Inference-time Reinforcement Learning
Qianyue Hao, Sibo Li, Jian Yuan, Yong Li

TL;DR
This paper introduces RL-of-Thoughts (RLoT), a reinforcement learning-based method that trains a lightweight navigator to adaptively select reasoning strategies, significantly improving LLM reasoning performance across diverse tasks without modifying the models.
Contribution
The paper proposes a novel RL-based navigator that dynamically adapts reasoning strategies for LLMs, outperforming existing inference-time techniques and demonstrating strong transferability.
Findings
RLoT outperforms existing inference-time methods by up to 13.4%.
A small RL navigator makes sub-10B LLMs comparable to 100B models.
The RL navigator generalizes well across unseen LLMs and tasks.
Abstract
Despite rapid advancements in large language models (LLMs), the token-level autoregressive nature constrains their complex reasoning capabilities. To enhance LLM reasoning, inference-time techniques, including Chain/Tree/Graph-of-Thought(s), successfully improve the performance, as they are fairly cost-effective by guiding reasoning through sophisticated logical structures without modifying LLMs' parameters. However, these manually predefined, task-agnostic frameworks are applied uniformly across diverse tasks, lacking adaptability. To improve this, we propose RL-of-Thoughts (RLoT), where we train a lightweight navigator model with reinforcement learning (RL) to adaptively enhance LLM reasoning at inference time. Specifically, we design five basic logic blocks from the perspective of human cognition. During the reasoning process, the trained RL navigator dynamically selects the suitable…
Peer Reviews
Decision·ICLR 2026 Poster
- The idea of viewing reasoning as an MDP with composable cognitive actions is original and conceptually appealing. The method is explained clearly. - The proposed navigator has fewer than 3K parameters, making it lightweight, computationally inexpensive, and compatible with frozen LLMs. - The authors experimented across reasoning, math, STEM, and commonsense domains, and also covered both small- and medium-sized models. - The authors show that their method can perform cross-model and cross-t
- There’s no ablation on reward signals or PRM accuracy, so it’s unclear whether improvements come from RL navigation or simply from PRM-guided prompting. - Some parts of the paper are not detailed enough, for example, section 4.5. How many instances were considered? The authors do not justify why Double-Dueling DQN is used. There is no exploration of policy stability, convergence, or sample efficiency. - The paper majorly lacks qualitative analysis. The paper presents only anecdotal examples o
- Reasoning as an MDP with a standardized action set and state interface; prompts for states/actions are fully specified and the MDP is clearly defined. The navigator adds negligible runtime overhead; training cost is amortized and reported with token accounting. - Good comparison with baselines (fixed workflows like DeAR, search methods like GoT, r\*, LiteSearch, DSPy, ...) with an "extended table" across GSM8K/GPQA/StrategyQA; The authors also report token usage vs. accuracy. - Proper ablation
Major problem is the insignificance of many results: For instance in Table 9, RLoT = 92.87 while several baselines cluster ≥92% (e.g. Buffer-of-Thoughts 92.35, ...). The paper text claims overall wins, but GSM8K specifically looks like parity rather than a clear lead; without error bars the <1% differences are hard to interpret and they are rather frequent in the paper.
1. The primary strength is the RLoT framework itself. It reframes the problem from "how to build a better reasoning LLM" or "what is the best fixed reasoning structure" to "how to learn a policy that navigates a frozen LLM's reasoning space." This is a clever and effective conceptual leap. 2. The results show that a sub-10B model with a 3K-parameter navigator can close most of the performance gap to a 70B model on tasks like GPQA and StrategyQA is a massive win. 3. The method is tested on mult
1. The state space, which is the sole input to the RL agent, is based on the LLM's own self-evaluation. The authors provide a validation (82% accuracy) in Appendix F and show it works better than alternatives. However, this remains a potential point of noise or failure. One might question if an LLM that is failing at a reasoning task (the "hard questions" used for training ) can simultaneously be a reliable evaluator of its own failing state. This could be particularly problematic for smaller, l
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Multi-Agent Systems and Negotiation · Natural Language Processing Techniques
