Tailored Primitive Initialization is the Secret Key to Reinforcement Learning
Yihang Yao, Guangtao Zeng, Raina Wu, Yang Zhang, Ding Zhao, Zhang-Wei Hong, Chuang Gan

TL;DR
This paper introduces Tailor, a finetuning pipeline that discovers and curates reasoning primitives to improve the initialization of language models, leading to more efficient and stable reinforcement learning for reasoning tasks.
Contribution
The paper proposes a novel method for initializing language models with diverse reasoning primitives, enhancing RL training efficiency and stability.
Findings
Tailor improves reasoning token coverage in LLMs.
Enhanced initialization leads to higher RL performance on reasoning benchmarks.
Diverse reasoning primitives result in more sample-efficient RL training.
Abstract
Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). While RL has demonstrated substantial performance gains, it still faces key challenges, including low sampling efficiency and a strong dependence on model initialization: some models achieve rapid improvements with minimal RL steps, while others require significant training data to make progress. In this work, we investigate these challenges through the lens of reasoning token coverage and argue that initializing LLMs with diverse, high-quality reasoning primitives is essential for achieving stable and sample-efficient RL training. We propose Tailor, a finetuning pipeline that automatically discovers and curates novel reasoning primitives, thereby expanding the coverage of reasoning-state distributions before RL. Extensive experiments on mathematical…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1.The paper focuses on the problem of data construction during the warm-start stage of reinforcement learning and proposes the Tailor pipeline, which centers on reasoning primitives to automatically select high-quality and diverse training data. The idea is novel and insightful. 2.The experimental design is relatively complete. Experiments on the iGSM and K&K benchmarks with small Llama and Qwen models provide sufficient evidence for the method’s effectiveness and reproducibility. 3.The paper is
1.Terminology and formalization are insufficient. The notion of thinking token coverage is ambiguous (it can be read as “diversity of reasoning patterns” or as a “token proportion”). If there is an accepted definition, please cite it explicitly in Related Work; if not, provide a computable formal definition and measurement protocol, and add derivations/claims in the preliminaries that directly support the core thesis. 2.Motivation and positioning are not strong enough. The Introduction and Relat
- The pipeline is simple and effective.
- The main idea of the paper is to use a teacher model to produce enhanced prompts that are concatenated with the original questions, and these augmented inputs are then used to construct SFT data for training the student model. However, this approach looks more like prompt engineering combined with data distillation rather than a fundamentally new learning paradigm. The so-called primitives are essentially reasoning strategies written as prompts, which the model is forced to imitate during SFT.
1. Grounded in a thinking-token (reasoning-primitive) coverage view, the paper optimizes the SFT stage with Tailor, an automated pipeline that analyzes failures and synthesizes repair-oriented primitives to curate diverse demonstrations—explicitly broadening the reasoning-state distribution before RL and thereby improving exploration during training. 2. The experiments are comprehensive: experiments on KK and iGSM across Llama3.2-1B/3B and Qwen2.5-0.5B/3B use DAPO with a rule-based reward and c
1. The main results center on KK and iGSM, which are relatively simple suites, while validation on widely recognized harder benchmarks is missing, for example, MATH or AIME-24/25. 2. This paper primarily uses Llama/Qwen base models and selects DeepSeek-V3 as the teacher model; however, current RL practice often targets reasoning models such as DeepSeek-R1 and its distilled variants. Since these models already exhibit stronger self-reflection in chain-of-thought, it remains unclear whether the p
- the method is more tailored to the student model, because the teacher model analyze from actual traces from the student models traces, rather than the traditional warm start where the teacher model just produce a static SFT dataset given some prompts - the topic is relavant, the writing is mostly clear
- more costly than static dataset because need to analyze traces by the teacher model. Also relies on the teacher model to correctly analyze the reasoning traces to produce valid and improving reasoning primatives. Overall, the method is more reliant on utilizing and relying on the potentially expensive teacher model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Reinforcement Learning in Robotics
