Efficient Online RFT with Plug-and-Play LLM Judges: Unlocking State-of-the-Art Performance
Rudransh Agnihotri, Ananya Pandey

TL;DR
This paper introduces a cost-effective, plug-and-play LLM-based judge that replaces heavyweight models in RLHF, achieving state-of-the-art performance and high interpretability with minimal additional parameters.
Contribution
It presents a novel method using a frozen instruction-tuned 7B LLM with a tiny LoRA adapter as an effective, transparent reward model for RLHF, eliminating the offline tuning phase.
Findings
Achieves 96.2% accuracy on RewardBench, outperforming larger reward networks.
Enables a 7B actor to surpass a 70B baseline in GSM-8K accuracy.
LoRA judge attains 9/10 similarity to human explanations in GPT-4 scoring.
Abstract
Reward-model training is the cost bottleneck in modern Reinforcement Learning Human Feedback (RLHF) pipelines, often requiring tens of billions of parameters and an offline preference-tuning phase. In the proposed method, a frozen, instruction-tuned 7B LLM is augmented with only a one line JSON rubric and a rank-16 LoRA adapter (affecting just 0.8% of the model's parameters), enabling it to serve as a complete substitute for the previously used heavyweight evaluation models. The plug-and-play judge achieves 96.2% accuracy on RewardBench, outperforming specialized reward networks ranging from 27B to 70B parameters. Additionally, it allows a 7B actor to outperform the top 70B DPO baseline, which scores 61.8%, by achieving 92% exact match accuracy on GSM-8K utilizing online PPO. Thorough ablations indicate that (i) six in context demonstrations deliver the majority of the zero-to-few-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Emotion and Mood Recognition · Adversarial Robustness in Machine Learning
