Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Wei Deng

arXiv:2605.17314·cs.CL·May 19, 2026

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

Wei Deng

PDF

TL;DR

The paper demonstrates that injecting mismatched, weaker model drafts into a stronger learner's training process can significantly improve performance on mathematical problem-solving benchmarks, outperforming standard fine-tuning methods.

Contribution

It introduces a novel off-policy training approach using mismatched drafts from a smaller model to enhance a stronger learner's capabilities in math problem-solving tasks.

Findings

01

Mismatched wrong drafts improve MATH-500 pass@1 by +1.62pp over matched variants.

02

Mismatched wrong drafts outperform other variants on out-of-distribution AIME 2025/2026.

03

The method achieves 71.98% on MATH-500, surpassing previous state-of-the-art models.

Abstract

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+ 1.62$ pp on MATH-500 (greedy pass@1) over the matched-wrong variant ( $n = 10$ seeds, $p = 0.0015$ , Welch's $t$ ). In fact, the mismatched-wrong variant leads…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.