Reinforcement Learning from Meta-Evaluation: Aligning Language Models Without Ground-Truth Labels
Micah Rentschler, Jesse Roberts

TL;DR
This paper presents RLME, a novel reinforcement learning approach that trains large language models using evaluator responses to natural-language questions, eliminating the need for ground-truth labels and improving training flexibility.
Contribution
RLME introduces a label-free reinforcement learning method for LLMs that uses meta-evaluation rewards, enabling scalable and controllable training without explicit ground-truth labels.
Findings
Achieves accuracy comparable to label-based methods
Enables training without ground-truth labels in open domains
Allows controllable trade-offs among multiple objectives
Abstract
Most reinforcement learning (RL) methods for training large language models (LLMs) require ground-truth labels or task-specific verifiers, limiting scalability when correctness is ambiguous or expensive to obtain. We introduce Reinforcement Learning from Meta-Evaluation (RLME), which optimizes a generator using reward derived from an evaluator's answers to natural-language meta-questions (e.g., "Is the answer correct?" or "Is the reasoning logically consistent?"). RLME treats the evaluator's probability of a positive judgment as a reward and updates the generator via group-relative policy optimization, enabling learning without labels. Across a suite of experiments, we show that RLME achieves accuracy and sample efficiency comparable to label-based training, enables controllable trade-offs among multiple objectives, steers models toward reliable reasoning patterns rather than post-hoc…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
