Cradle: Empowering Foundation Agents Towards General Computer Control
Weihao Tan, Wentao Zhang, Xinrun Xu, Haochong Xia, Ziluo Ding, Boyu, Li, Bohan Zhou, Junpeng Yue, Jiechuan Jiang, Yewen Li, Ruyi An, Molei Qin,, Chuqiao Zong, Longtao Zheng, Yujie Wu, Xiaoqiang Chai, Yifei Bi, Tianbao Xie,, Pengjie Gu, Xiyun Li, Ceyao Zhang, Long Tian

TL;DR
Cradle is a modular framework that enables foundation agents to interact with diverse software through standardized interfaces, demonstrating high generalization and success in complex tasks across games and applications.
Contribution
Introduces Cradle, a flexible LMM-powered framework that allows foundation agents to control any software via screenshots and keyboard/mouse actions, advancing towards general computer control.
Findings
Successfully completed complex tasks in four commercial video games.
Operated daily software like Chrome and Outlook effectively.
Enabled long-horizon missions in AAA games like RDR2.
Abstract
Despite the success in specific scenarios, existing foundation agents still struggle to generalize across various virtual scenarios, mainly due to the dramatically different encapsulations of environments with manually designed observation and action spaces. To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i.e., using screenshots as input and keyboard and mouse actions as output. We introduce Cradle, a modular and flexible LMM-powered framework, as a preliminary attempt towards GCC. Enhanced by six key modules, Cradle can understand input screenshots and output executable code for low-level keyboard and mouse control after high-level planning, so that Cradle can interact with any software and complete long-horizon complex tasks without relying on any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation
