Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Yuan Sui; Bryan Hooi

arXiv:2601.21464·cs.CL·May 8, 2026

Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Yuan Sui, Bryan Hooi

PDF

TL;DR

The paper introduces CoNL, a self-evolving framework for LLMs that uses multi-agent self-play to improve evaluation and generation capabilities without external ground truth.

Contribution

It presents a novel meta-evaluation framework enabling LLMs to self-improve through structured multi-agent conversations and critique-based training.

Findings

01

CoNL outperforms self-rewarding baselines in various benchmarks.

02

The framework maintains stable training while enhancing evaluation and generation.

03

Meta-evaluation improves LLM performance without external labels.

Abstract

Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.