Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study
Zachary Burton

TL;DR
This study investigates inference-time diversity in RL-trained Lean theorem provers, revealing that strategic prompt variations significantly improve theorem-solving performance and are specific to RL training stages.
Contribution
It introduces a diagnostic approach to understanding inference-time diversity effects, demonstrating that prompt diversity can mitigate mode-collapse in RL-trained provers.
Findings
Doubling sampling budget yields no additional solved theorems.
A fixed schedule of tactic skeletons improves theorem-solving by 45%.
Diversity interventions have RL-specific effects, not seen in SFT-trained models.
Abstract
RL-trained Lean theorem provers mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the i.i.d.\ sampling budget from to produces zero additional solved theorems (42/244 in both cases). A fixed schedule of 15 tactic skeletons breaks this plateau and recovers a relative improvement at (mean theorems across seeds, sign preserved in every seed). A controlled diversity ablation rules out the prompt-diversity confound: tactic skeletons help, paraphrases match the baseline, and irrelevant Lean comments actively degrade. A leave-one-out formalization-difficulty stratification reveals a structural-content gradient across the three perturbations. The phenomenon is RL-specific: V1.5-Base proves zero theorems regardless of intervention, identifying RL as the stage that creates the proof capability…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Data Classification
