Weak-to-Strong Elicitation via Mismatched Wrong Drafts
Wei Deng

TL;DR
The paper demonstrates that injecting mismatched, weaker model drafts into a stronger learner's training process can significantly improve performance on mathematical problem-solving benchmarks, outperforming standard fine-tuning methods.
Contribution
It introduces a novel off-policy training approach using mismatched drafts from a smaller model to enhance a stronger learner's capabilities in math problem-solving tasks.
Findings
Mismatched wrong drafts improve MATH-500 pass@1 by +1.62pp over matched variants.
Mismatched wrong drafts outperform other variants on out-of-distribution AIME 2025/2026.
The method achieves 71.98% on MATH-500, surpassing previous state-of-the-art models.
Abstract
We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields pp on MATH-500 (greedy pass@1) over the matched-wrong variant ( seeds, , Welch's ). In fact, the mismatched-wrong variant leads…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
