RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models
Andrew Zhuoer Feng, Cunxiang Wang, Bosi Wen, Yidong Wang, Yu Luo, Hongning Wang, Minlie Huang

TL;DR
RLAR introduces a dynamic, agent-driven reward system for multi-task reinforcement learning with large language models, improving generalization and performance across diverse tasks by autonomously synthesizing and updating reward functions.
Contribution
It proposes RLAR, a novel framework that enables LLM agents to autonomously retrieve and synthesize reward models, addressing static reward limitations in RL training.
Findings
RLAR improves performance by 10-60% across multiple tasks.
It outperforms static reward baselines on RewardBench-V2.
RLAR approaches the performance upper bound, demonstrating strong generalization.
Abstract
Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent-driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self-evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- A tool that automates the provisioning of reward signals for any specified task could significantly accelerate research by abstracting away the challenging reward-design component of LLM training. Furthermore, such a tool could serve as a standard benchmark, allowing various approaches to be meaningfully compared using a common reward signal. - The idea of using both web and code agents to generate the reward signal is compelling, as it offers the potential to cover a broad range of tasks by h
- The paper's overall quality and presentation are subpar. It suffers from numerous typos and incomplete sentences. Furthermore, key sections are underdeveloped (e.g., code agents), while critical details about core mechanics are omitted (e.g., the web agent's ranking, filtering, and semantic similarity computation). These omissions create significant gaps in understanding. Finally, the limitations section exceeds the page limit. - The empirical evaluation is narrow and insufficient. Given the a
- The paper is well-organized and clearly articulates the problem of reward model generalization. - It proposes a novel and promising conceptual solution to a significant bottleneck in RL alignment: the high cost and catastrophic forgetting associated with training numerous task-specific reward models. - The concept of an "agentic" system that dynamically selects reward functions is nontrivial and makes novel contribution. - The engineering part may also contributes to the community.
+ The experiments are limited to small-scale models. While this serves as a good proof-of-concept, the paper would be much stronger if the findings were validated on larger models. + Also, to demonstrate true scalability and practical applicability, current tasks and datasets need to be expanded. + Some analysis about the effiency would be appreciated.
1. Analysis shows RLAR consistently generates and deploys high‑quality, task‑aligned rewards from diverse sources, with code agents creating executable tools in 94.9% of cases and web agents integrating 47.6% of retrieved repositories. 2. During training, RLAR adaptively selects from a portfolio of evaluators, with LLM‑based reward models used in 96.4% of cases and task‑specific rule‑based checks applied when appropriate. This matching of evaluators to domain characteristics produces smoother,
1. The paper could benefit from a more detailed literature review and explicit comparisons with other dynamic reward selection or adaptive RL alignment methods to position RLVR/RLAR within existing research. 2. The criteria for reward tools are unclear to me. The paper should explain the evaluation metrics or qualitative measures used to judge tool quality, possibly including human or benchmark‑based assessments, error analysis, and robustness checks.
1. Idea: Clear, practical framing of “tool-ized” reward design that orchestrates rule/metric tools with retrieved reward models, addressing known issues of distribution shift and monolithic reward models in heterogeneous training. 2. Efficiency: Reports substantially lower time and token cost than purely generative RMs used as judges, which is valuable for scaling RL post-training. 3. Breadth: Evaluates across several task categories and includes some out-of-domain generalization checks (e.g.,
1. Benchmark currency and coverage: MT-Bench is aging; more recent reasoning/agentic evals like alpaeval2 or Arena-Hard2 are omitted, and broader widely-used public leaderboards for instruction-following/reasoning are missing, making claims on generalization and reasoning less persuasive. 2. Baseline sufficiency on verifiable tasks: For math and other verifiable domains, recent RLVR-style baselines and strong verifiable-reward pipelines are not included; without these, it is hard to isolate the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
