ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Qiang Zhang; Boli Chen; Fanrui Zhang; Ruixue Ding; Shihang Wang; Qiuchen Wang; Yinfeng Huang; Haonan Zhang; Rongxiang Zhu; Pengyong Wang; Ailin Ren; Xin Li; Pengjun Xie; Jiawei Liu; Ning Guo; Jingren Zhou; Zheng-Jun Zha

arXiv:2601.06487·cs.LG·January 23, 2026

ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking

Qiang Zhang, Boli Chen, Fanrui Zhang, Ruixue Ding, Shihang Wang, Qiuchen Wang, Yinfeng Huang, Haonan Zhang, Rongxiang Zhu, Pengyong Wang, Ailin Ren, Xin Li, Pengjun Xie, Jiawei Liu, Ning Guo, Jingren Zhou, Zheng-Jun Zha

PDF

Open Access 2 Datasets

TL;DR

ArenaRL introduces a tournament-based relative ranking approach for reinforcement learning, improving open-ended agent training by addressing reward model discrimination issues and achieving more robust solutions in complex tasks.

Contribution

The paper proposes ArenaRL, a novel RL paradigm using intra-group relative ranking and tournament schemes to enhance training stability and effectiveness for open-ended tasks.

Findings

01

ArenaRL achieves nearly full pairwise comparison accuracy with linear complexity.

02

It outperforms standard RL baselines on new open-ended benchmarks.

03

The approach improves robustness and solution quality for complex real-world tasks.

Abstract

Reinforcement learning has substantially improved the performance of LLM agents on tasks with verifiable outcomes, but it still struggles on open-ended agent tasks with vast solution spaces (e.g., complex travel planning). Due to the absence of objective ground-truth for these tasks, current RL algorithms largely rely on reward models that assign scalar scores to individual responses. We contend that such pointwise scoring suffers from an inherent discrimination collapse: the reward model struggles to distinguish subtle advantages among different trajectories, resulting in scores within a group being compressed into a narrow range. Consequently, the effective reward signal becomes dominated by noise from the reward model, leading to optimization stagnation. To address this, we propose ArenaRL, a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Autonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications