SERL: Self-Examining Reinforcement Learning on Open-Domain

Weixuan Ou; Yanzhao Zheng; Shuoshuo Sun; Wei Zhang; Baohua Dong; Hangcheng Zhu; Ruohui Huang; Gang Yu; Pengwei Yan; Yifan Qiao

arXiv:2511.07922·cs.LG·February 26, 2026

SERL: Self-Examining Reinforcement Learning on Open-Domain

Weixuan Ou, Yanzhao Zheng, Shuoshuo Sun, Wei Zhang, Baohua Dong, Hangcheng Zhu, Ruohui Huang, Gang Yu, Pengwei Yan, Yifan Qiao

PDF

Open Access

TL;DR

SERL introduces a self-improving reinforcement learning framework where large language models act as both the agent and evaluator, using internal reward mechanisms to enhance open-domain task performance without external signals.

Contribution

This paper presents SERL, a novel self-examining RL framework that eliminates the need for external rewards by deriving internal rewards from pairwise comparisons and self-consistency, advancing open-domain LLM capabilities.

Findings

01

SERL outperforms existing self-improvement methods on AlpacaEval 2.

02

SERL achieves state-of-the-art results among self-improving approaches.

03

SERL's performance is comparable to larger models like Qwen3-32B.

Abstract

Reinforcement Learning (RL) has been shown to improve the capabilities of large language models (LLMs). However, applying RL to open-domain tasks faces two key challenges: (1) the inherent subjectivity of these tasks prevents the verifiable rewards as required by Reinforcement Learning with Verifiable Rewards (RLVR); (2) Reinforcement Learning from Human Feedback (RLHF) relies on external reward mechanisms. To overcome these limitations, we propose Self-Examining Reinforcement Learning (SERL), a novel self-improving framework where the LLM serves as both Actor and Judge. SERL introduces two synergistic reward mechanisms without any external signals. On the one hand, to improve the Actor's capability, we derive rewards from Copeland-style pairwise comparison judgments across a group of generated responses. On the other hand, a self-consistency reward that encourages coherent judgments is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Reinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning