Agent S: An Open Agentic Framework that Uses Computers Like a Human
Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric, Wang

TL;DR
Agent S is a novel open framework that automates complex human-computer interactions using hierarchical planning and multimodal language models, significantly improving success rates on benchmark tasks.
Contribution
It introduces experience-augmented hierarchical planning and an Agent-Computer Interface for GUI automation, advancing the state-of-the-art in autonomous human-computer interaction.
Findings
Outperforms baseline by 9.37% on OSWorld benchmark
Achieves 83.6% success rate, a significant improvement
Demonstrates broad generalizability to different OS environments
Abstract
We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S…
Peer Reviews
Decision·ICLR 2025 Poster
1. Agent S stands out for its task automation through experience-augmented hierarchical planning. This method harnesses external web knowledge and draws upon internal memories, enabling the agent to decompose complex tasks into executable subtasks. 2. The introduction of the Agent-Computer Interface (ACI) is a notable strength of Agent S. This interface serves as a critical abstraction layer that facilitates precise perception and action in GUI environments. By defining a bounded action space wi
1. The paper does not address the scalability and efficiency of the framework when handling a large volume of tasks or more complex workflows. There is a need to evaluate how the agent performs under increased load and whether the hierarchical planning and memory update mechanisms can scale without compromising the speed and accuracy of task completion. 2. The framework's performance could potentially falter in scenarios where reliable web knowledge is scarce or when there are frequent, rapid ch
1. The performance of the proposed framework on OSWorld benchmark is quite good. 2. The proposed framework is well-engineered and the evaluation is systematic. 3. The presentation and visualization of the paper is good.
1. It would be unfair to compare the framework only to the baseline from OSWorld, which is a benchmark paper, not a methodology paper. 2. The self-supervised exploration process is not realistic in actual deployments and I believe it will lead to overfitting. 3. The ACI proposed in this paper can only act on selected elements in the accessibility tree, which somewhat sacrifices flexibility for performance because you cannot click on every coordinate of the screen. 4. While the framework is well
1. **Novel and Effective Memory Mechanism**: Introduces a well-designed memory system with both narrative and episodic components Provides clear algorithms for both initial memory construction and continuous updates Demonstrates a complete closed-loop system with practical effectiveness 2. **Insightful Analysis of Agent-Computer Interaction**: Deep analysis of fundamental challenges in MLLM-based computer control Identifies key issues like discrete time response, lack of internal coordinate sy
1. **Limited Problem Definition**: The paper could benefit from a more detailed introduction to computer automation tasks Key concepts like planning, execution, and grounding could be better explained for readers new to the field 2. **Presentation Issues**: Some overlap between Figures 3 and 4 that could be consolidated or better differentiated Technical details of the ACI implementation could be more thoroughly described
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation
