Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling
Yizhu Jiao, Ruixiang Zhang, Richard Bai, Jiawei Han, Ronan Collobert, Yizhe Zhang

TL;DR
This paper introduces DuST, a dual self-training framework that leverages relative correctness judgments from test-time sampling to improve code generation models' judgment and generation capabilities.
Contribution
It proposes a novel dual judgment space and a discriminative training method that enhances model performance without direct reward for correct programs.
Findings
DuST improves judgment quality by +6.2 NDCG on LiveCodeBench.
Single-sample pass@1 improves by +3.1 with DuST.
Best-of-4 accuracy increases by +4.1 on Qwen3-30B-Thinking.
Abstract
Code generation is typically trained in the primal space of programs: a model produces a candidate solution and receives sparse execution feedback, often a single pass/fail bit. Test-time scaling enriches the inference procedure by sampling multiple candidates and judging among them, but the comparative information this process reveals is discarded after inference. We argue that this information defines a dual judgment space that provides a far richer training signal: the model learns not from an isolated success or failure, but from the relative correctness structure across its own plausible attempts, identifying which succeed, which fail, and what distinguishes them. We introduce DuST (Dual Self-Training), a framework for self-training from the dual judgment space. DuST samples candidate programs from the model's own distribution, labels them through sandbox execution, retains groups…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
