Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling
Derek Li, Jiaming Zhou, Leo Maxime Brunswic, Abbas Ghaddar, Qianyi Sun, Liheng Ma, Yu Luo, Dong Li, Mark Coates, Jianye Hao, Yingxue Zhang

TL;DR
Omni-Thinker introduces a unified RL framework that enhances multi-task learning in LLMs by combining hybrid rewards with task scheduling based on backward transfer, leading to improved performance across diverse domains.
Contribution
The paper presents a novel RL framework that integrates hybrid rewards and BWT-guided scheduling to scale LLMs for multi-task learning, addressing forgetting and performance issues.
Findings
Achieved 6.2% gain over joint training
Achieved 12.4% gain over model merging
Accurate prediction of curriculum outcomes using BWT assumptions
Abstract
The pursuit of general-purpose artificial intelligence depends on large language models (LLMs) that can handle both structured reasoning and open-ended generation. We present Omni-Thinker, a unified reinforcement learning (RL) framework that scales LLMs across diverse tasks by combining hybrid rewards with backward-transfer-guided scheduling. Hybrid rewards integrate rule-based verifiable signals with preference-based evaluations from an LLM-as-a-Judge, enabling learning in both deterministic and subjective domains. Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance. Experiments across four domains show gains of 6.2% over joint training and 12.4% over model merging. Moreover, we demonstrate that simple assumptions on accuracy transfer yield accurate predictions of curriculum outcomes, with entropy dynamics…
Peer Reviews
Decision·Submitted to ICLR 2026
* This paper proposes a simple framework for ordering tasks. The final ordering heuristics make intuitive sense.
* This work relies on overly simplisitc assumptions (for both assumptions) and there are no sufficient evidence to justify them. Also see questions section. * Authors claim that the predicted accuracy using test set backward transfers are surprisingly precise, however, table 3 shows relatively low correlations between test and predicted accuracies.
1. This framework addresses the inconsistency in optimization direction across different tasks in the reinforcement learning process, integrating verifiable rule-based rewards and preference-based LLM evaluation into a unified reinforcement learning paradigm. 2. The proposed BMT, by quantifying how learning a task influences the performance of previously learned tasks, provides a referable paradigm for the learning order in curriculum learning, mitigating the catastrophic forgetting problem to
1. As mentioned in the article, the overhead of curriculum scheduling increases gradually with the increase in workload, and the scalability of the proposed method may be limited. Are there efficient strategies for real-world deployment? 2. The paper presents results using Qwen2.5-7B as the base model for all experiments. Would the same backward-transfer-guided scheduling strategy remain optimal for significantly smaller or larger models? 3. The overall framework, particularly the curriculum des
- The use of backward transfer matrices to guide curriculum ordering is principled and builds on established continual learning concepts. - The paper is well written and easy to follow.
Please see my detailed questions and concerns below.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCloud Computing and Resource Management · Scheduling and Optimization Algorithms · Real-Time Systems Scheduling
