TL;DR
MGA introduces a memory-driven framework for GUI agents that decouples long-horizon tasks into independent steps, reducing complexity and improving efficiency in GUI automation.
Contribution
It proposes a minimalist, memory-based approach that enhances GUI agent performance by decoupling decision steps and eliminating redundant modules.
Findings
MGA achieves competitive performance on OSWorld and real-world GUI tasks.
The structured memory mechanism reduces system redundancy and cognitive overhead.
MGA maintains high efficiency with a simplified architecture.
Abstract
Multimodal Large Language Models (MLLMs) have significantly advanced GUI agents, yet long-horizon automation remains constrained by two critical bottlenecks: context overload from raw sequential trajectory dependence and architectural redundancy from over-engineered expert modules. Prevailing End-to-End and Multi-Agent paradigms struggle with error cascades caused by concatenated visual-textual histories and incur high inference latency due to redundant expert components, limiting their practical deployment. To address these issues, we propose the Memory-Driven GUI Agent (MGA), a minimalist framework that decouples long-horizon trajectories into independent decision steps linked by a structured state memory. MGA operates on an ``Observe First and Memory Enhancement`` principle, powered by two tightly coupled core mechanisms: (1) an Observer module that acts as a task-agnostic,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
