Peer-Predictive Self-Training for Language Model Reasoning
Shi Feng, Hanlin Zhang, Fan Nie, Sham Kakade, and Yiling Chen

TL;DR
This paper introduces Peer-Predictive Self-Training (PST), a collaborative, label-free fine-tuning method where multiple language models improve through internal feedback without external supervision.
Contribution
PST leverages cross-model aggregated responses and mutual information to enhance self-training, improving reasoning accuracy and reducing the generator-verifier gap without external labels.
Findings
PST improves exact-match accuracy by 2.2 to 4.3 percentage points.
PST reduces the generator-verifier gap by 26 to 40%.
PST requires no external supervision, relying solely on cross-model interactions.
Abstract
Mechanisms for continued self-improvement of language models without external supervision remain an open challenge. We propose Peer-Predictive Self-Training (PST), a label-free fine-tuning framework in which multiple language models improve collaboratively by leveraging a cross-model aggregated response as an internal training signal. Given a prompt question, the models generate responses sequentially; the final aggregated answer, often more reliable than individual responses in practice, serves as an internal target for learning. We measure how informative each intermediate response is about the aggregate using pointwise mutual information (PMI), and use this signal to scale self-training updates. Responses already aligned with the aggregate are updated less, while less informative or misaligned responses are updated more. On mathematical reasoning benchmarks (SimulEq, Math500, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
