Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Derek Li; Jiaming Zhou; Leo Maxime Brunswic; Abbas Ghaddar; Qianyi Sun; Liheng Ma; Yu Luo; Dong Li; Mark Coates; Jianye Hao; Yingxue Zhang

arXiv:2507.14783·cs.LG·September 30, 2025

Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling

Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, Yingxue Zhang

PDF

Open Access 3 Reviews

TL;DR

Omni-Thinker introduces a unified RL framework that enhances multi-task learning in LLMs by combining hybrid rewards with task scheduling based on backward transfer, leading to improved performance across diverse domains.

Contribution

The paper presents a novel RL framework that integrates hybrid rewards and BWT-guided scheduling to scale LLMs for multi-task learning, addressing forgetting and performance issues.

Findings

01

Achieved 6.2% gain over joint training

02

Achieved 12.4% gain over model merging

03

Accurate prediction of curriculum outcomes using BWT assumptions

Abstract

The pursuit of general-purpose artificial intelligence depends on large language models (LLMs) that can handle both structured reasoning and open-ended generation. We present Omni-Thinker, a unified reinforcement learning (RL) framework that scales LLMs across diverse tasks by combining hybrid rewards with backward-transfer-guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from an LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance. Experiments across four domains show gains of 6.2% over joint training and 12.4% over model merging. Moreover, we demonstrate that simple assumptions on accuracy transfer yield accurate predictions of curriculum outcomes, with entropy dynamics…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

* This paper proposes a simple framework for ordering tasks. The final ordering heuristics make intuitive sense.

Weaknesses

* This work relies on overly simplisitc assumptions (for both assumptions) and there are no sufficient evidence to justify them. Also see questions section. * Authors claim that the predicted accuracy using test set backward transfers are surprisingly precise, however, table 3 shows relatively low correlations between test and predicted accuracies.

Reviewer 02Rating 6Confidence 3

Strengths

1. This framework addresses the inconsistency in optimization direction across different tasks in the reinforcement learning process, integrating verifiable rule-based rewards and preference-based LLM evaluation into a unified reinforcement learning paradigm. 2. The proposed BMT, by quantifying how learning a task influences the performance of previously learned tasks, provides a referable paradigm for the learning order in curriculum learning, mitigating the catastrophic forgetting problem to

Weaknesses

1. As mentioned in the article, the overhead of curriculum scheduling increases gradually with the increase in workload, and the scalability of the proposed method may be limited. Are there efficient strategies for real-world deployment? 2. The paper presents results using Qwen2.5-7B as the base model for all experiments. Would the same backward-transfer-guided scheduling strategy remain optimal for significantly smaller or larger models? 3. The overall framework, particularly the curriculum des

Reviewer 03Rating 4Confidence 4

Strengths

- The use of backward transfer matrices to guide curriculum ordering is principled and builds on established continual learning concepts. - The paper is well written and easy to follow.

Weaknesses

Please see my detailed questions and concerns below.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCloud Computing and Resource Management · Scheduling and Optimization Algorithms · Real-Time Systems Scheduling