Warm Up Before You Train: Unlocking General Reasoning in Resource-Constrained Settings
Safal Shrestha, Minwu Kim, Aadim Nepal, Anubhav Shrestha, Keith Ross

TL;DR
This paper introduces a two-stage, sample-efficient training method for reasoning large language models, combining a warmup phase with distillation from logic puzzles and limited supervised RLVR to enhance reasoning and generalization in data-scarce settings.
Contribution
The paper proposes a novel warmup strategy using logic puzzles for general reasoning, improving performance and sample efficiency in limited-data scenarios.
Findings
Warmup alone improves reasoning across multiple tasks.
Warmed-up models outperform base models with limited RLVR data.
Warmup maintains cross-domain generalization after domain-specific training.
Abstract
Designing effective reasoning-capable LLMs typically requires training using Reinforcement Learning with Verifiable Rewards (RLVR) or distillation with carefully curated Long Chain of Thoughts (CoT), both of which depend heavily on extensive training data. This creates a major challenge when the amount of quality training data is scarce. We propose a sample-efficient, two-stage training strategy to develop reasoning LLMs under limited supervision. In the first stage, we "warm up" the model by distilling Long CoTs from a toy domain, namely, Knights \& Knaves (K\&K) logic puzzles to acquire general reasoning skills. In the second stage, we apply RLVR to the warmed-up model using a limited set of target-domain examples. Our experiments demonstrate that this two-phase approach offers several benefits: the warmup phase alone facilitates generalized reasoning, leading to performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsBalanced Selection · Sparse Evolutionary Training
