RTLC -- Research, Teach-to-Learn, Critique: A three-stage prompting paradigm inspired by the Feynman Learning Technique that lifts LLM-as-judge accuracy on JudgeBench with no fine-tuning
Andrea Morandi

TL;DR
RTLC introduces a three-stage prompting paradigm inspired by the Feynman Learning Technique that significantly improves LLM-based judge accuracy on JudgeBench without fine-tuning.
Contribution
It presents a novel three-stage prompting recipe that enhances LLM judgment accuracy through structured reasoning, critique, and ensemble methods without additional training or external tools.
Findings
RTLC improves Claude 3.7 Sonnet's accuracy from 64.6% to 78.6% on JudgeBench.
RTLC outperforms self-consistency voting and single-shot prompts.
Ablation shows Teach-to-Learn scaffold adds 9.4 percentage points to accuracy.
Abstract
LLM-as-a-judge is now the default measurement instrument for open-ended generation, but on the public JudgeBench benchmark even strong instruction-tuned judges barely scrape past random on objective-correctness pairwise items. We introduce RTLC, a three-stage prompting recipe -- Research, Teach-to-Learn, Critique -- that promotes a single black-box LLM into an ensemble-of-thought judge with no fine-tuning, retrieval, or external tools. Stage 1 wraps the input in a fixed pedagogical scaffold porting the Feynman Learning Technique (study teach find gaps simplify) into LLM prompting. Stage 2 draws N=10 independent candidate verdicts at temperature 0.4. Stage 3 acts as its own critic, cross-comparing the candidate set against the original question to emit one critiqued verdict at temperature 0. On JudgeBench-GPT (350 hard pairwise items), Claude 3.7 Sonnet's pairwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
