MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo

TL;DR
MementoGUI introduces a modular, learned memory control framework for long-horizon GUI agents, enhancing their ability to maintain and utilize task-relevant visual and textual history for improved decision-making.
Contribution
The paper presents MementoGUI, a plug-in agentic memory system with learned controllers for online memory management, enabling better long-term GUI task performance without finetuning the backbone.
Findings
MementoGUI improves GUI agent performance over baseline methods.
Larger MementoCore backbones lead to stronger memory-augmented control.
The framework effectively handles long-horizon decision-making in GUI tasks.
Abstract
Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
