TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning
Jinyang Wu, Chonghua Liao, Mingkuan Feng, Shuai Zhang, Zhengqi Wen, Haoran Luo, Ling Yang, Huazhe Xu, Jianhua Tao

TL;DR
TemplateRL introduces a structured, template-guided reinforcement learning framework that improves reasoning efficiency and transferability by integrating explicit problem-solving templates into policy training.
Contribution
It presents a novel structured RL approach using a template library built via MCTS, enhancing training stability, efficiency, and interpretability over existing unstructured methods.
Findings
TemplateRL outperforms GRPO by 99% on AIME.
TemplateRL outperforms GRPO by 41% on AMC.
TemplateRL demonstrates superior stability and cross-domain generalization.
Abstract
Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO typically rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address this limitation, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
MethodsBalanced Selection
