ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents
Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang

TL;DR
ComputerRL introduces a scalable framework for training autonomous desktop agents using reinforcement learning, combining API and GUI interactions, with a novel training strategy to improve performance and generalization.
Contribution
The paper presents ComputerRL, a scalable distributed RL infrastructure and Entropulse training method for improving desktop agent performance.
Findings
Achieved 48.9% accuracy on OSWorld benchmark.
Enabled large-scale online RL with thousands of virtual desktops.
Improved generalization of desktop agents through new training strategies.
Abstract
We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks; however, it remains challenging due to environmental inefficiency and instability during extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is overall well written. - Novel API-GUI paradigm for more generality in computer-based tasks. - New Entropulse training strategy that mitigates entropy collapse by alternating between supervised learning and RL. - Scalable and asynchronous RL training pipeline - Strong empirical results
Overall, the contribution lies more in engineering execution than in theoretical advancement. - Limited algorithmic novelty: primarily builds upon GPRO, and alternating between SFT and RL is similar to exploration-refresh or replay strategies. - The paper does not fully disentangle how much gain comes from ComputerRL’s methods compared to having strong pre-trained models. - Limited experimental diversity: they mostly evaluate on OSWorld and OfficeWorld, with no long-horizon or multi-user adaptat
- **Strong systems and engineering contribution:** The distributed RL infrastructure is technically impressive, enabling large-scale online RL across thousands of virtualized desktop environments. Such scale is rare in open research and represents a substantial engineering achievement. - **Practical API-GUI paradigm:** The unified action space combining GUI operations with automatically constructed APIs addresses a key bottleneck in desktop automation. The LLM-driven API construction pipeline
- **Unsubstantiated claims about diversity and exploration:** The paper claims that alternating SFT with RL increases exploration and diversity, yet no quantitative evidence is provided. Metrics such as action entropy, trajectory variance, or coverage are not analyzed. The only evidence is a qualitative entropy curve, which is insufficient. - **Incomplete empirical rigor and reproducibility:** All training curves appear to represent single runs without confidence intervals or variance estimates
The API-GUI unification is an interesting engineering contribution that bridges the gap between human-designed interfaces and agent-level programmatic control. The distributed RL infrastructure is impressive in scale and demonstrates strong engineering capability, enabling parallelized desktop environments at large scale. The Entropulse idea addresses an important issue in long-horizon RL (entropy collapse), and the empirical results suggest measurable benefits in maintaining exploration and t
The paper’s novelty lies primarily in implementation and scaling, not in new algorithmic contributions. The API-GUI paradigm is conceptually straightforward—it effectively automates API construction via LLMs rather than introducing a new interaction or reasoning mechanism. Similarly, the Entropulse training alternation between RL and SFT is more of a practical training schedule than a novel learning algorithm. The training curves in the figures appear to correspond to single runs, with no error
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics
