TL;DR
Gym-Anything is a framework that converts any software into an interactive environment for training and evaluating computer-use agents, enabling scalable, long-horizon tasks across diverse domains.
Contribution
It introduces a novel pipeline for automatic environment creation from software, producing a large, diverse benchmark dataset and improving agent performance through multi-agent auditing.
Findings
Created CUA-World with 10K+ tasks across multiple domains.
Developed a vision-language model that outperforms larger models on the benchmark.
Enhanced agent performance by applying auditing feedback at test time.
Abstract
Computer-use agents hold the promise of assisting in a wide range of digital economic activities. However, current research has largely focused on short-horizon tasks over a limited set of software with limited economic value, such as basic e-commerce and OS-configuration tasks. A key reason is that creating environments for complex software requires significant time and human effort, and therefore does not scale. To address this, we introduce Gym-Anything, a framework for converting any software into an interactive computer-use environment. We frame environment creation itself as a multi-agent task: a coding agent writes setup scripts, downloads real-world data, and configures the software, while producing evidence of correct setup. An independent audit agent then verifies evidence for the environment setup against a quality checklist. Using a taxonomy of economically valuable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
