MARLaaS: Multi-Tenant Asynchronous Reinforcement Learning as a Service
Timothy Tin Long Yu, Gursimran Singh, Ge Shi, Hanieh Sadri, Yong Zhang, Zhenan Fan

TL;DR
MARLaaS introduces a multi-tenant asynchronous system for efficient reinforcement learning fine-tuning of large language models, enabling concurrent multi-task training with significant speedups and resource utilization improvements.
Contribution
It presents a novel architecture combining shared models with lightweight adapters and asynchronous stages for scalable multi-task RL fine-tuning.
Findings
Achieves up to 32 concurrent tasks with state-of-the-art performance.
Improves accelerator utilization by up to 4.3x.
Reduces end-to-end training time by 85%.
Abstract
Reinforcement Learning from Verifiable Rewards (RLVR) has significantly improved the reasoning capabilities of large language models (LLMs), particularly in multi-turn agentic settings involving environment interaction like tool use. However, fine-tuning such models remains prohibitively expensive due to high computational requirements, limiting accessibility. We propose MARLaaS (Multi-tenant Asynchronous RL as a Service), a system for concurrent RL fine-tuning across multiple users and tasks. Our approach is based on two key ideas: (1) sharing a base model across tenants using lightweight LoRA adapters, and (2) a disaggregated asynchronous architecture that decouples rollout generation, environment interaction, and policy training into independently scheduled stages. This design enables tasks to progress through the RL pipeline at their own pace in an event-driven manner, reducing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
