EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics
Shuyue Stella Li, Rui Xin, Teng Xiao, Yike Wang, Rulin Shao, Zoey Hao, Melanie Sclar, Sewoong Oh, Faeze Brahman, Pang Wei Koh, Yulia Tsvetkov

TL;DR
EVOLM introduces a self-evolving training method for language models that uses internally generated discriminative rubrics as reward signals, eliminating the need for external supervision.
Contribution
It presents a novel approach where a language model co-trains a rubric generator and a policy, enabling self-improvement solely from its own evaluative capacity.
Findings
EVOLM-trained Qwen3-8B outperforms GPT-4.1 on RewardBench-2 by 25.7%.
The policy achieves 69.3% on OLMo3-Adapt, surpassing models trained with external rubrics.
Self-supervised rubrics enable significant performance gains without human annotations.
Abstract
Language models encode substantial evaluative knowledge from pretraining, yet current post-training methods rely on external supervision (human annotations, proprietary models, or scalar reward models) to produce reward signals. Each imposes a ceiling. Human judgment cannot supervise capabilities beyond its own, proprietary APIs create dependencies, and verifiable rewards cover only domains with ground-truth answers. Self-improvement from a model's own evaluative capacity is a reward source that scales with the model itself, yet remains largely untapped by current methods. We introduce EVOLM, a post-training method that structures this capacity into explicit discriminative rubrics and uses them as training signal. EVOLM trains two capabilities within a single language model in alternation: (1) a rubric generator producing instance-specific evaluation criteria optimized for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
