MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Ziyun Zeng; Hang Hua; Bocheng Zou; Mu Cai; Rogerio Feris; Jiebo Luo

arXiv:2605.18652·cs.CV·May 19, 2026

MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents

Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo

PDF

TL;DR

MementoGUI introduces a modular, learned memory control framework for long-horizon GUI agents, enhancing their ability to maintain and utilize task-relevant visual and textual history for improved decision-making.

Contribution

The paper presents MementoGUI, a plug-in agentic memory system with learned controllers for online memory management, enabling better long-term GUI task performance without finetuning the backbone.

Findings

01

MementoGUI improves GUI agent performance over baseline methods.

02

Larger MementoCore backbones lead to stronger memory-augmented control.

03

The framework effectively handles long-horizon decision-making in GUI tasks.

Abstract

Recent GUI agents have made substantial progress in visual grounding and action prediction, yet they remain brittle in long-horizon tasks that require maintaining task state across many interface transitions. Existing agents typically rely on raw history replay or text-only memory, which either overwhelms the model with redundant screenshots or discards localized visual evidence needed for future decisions. To address these limitations, we introduce \textbf{MementoGUI}, a plug-in agentic memory framework that equips MLLM-based GUI agents with \textbf{MementoCore}, a learned controller for online memory selection, compression, and retrieval. Rather than treating interaction history as a fixed context, MementoGUI formulates long-horizon GUI control as an online memory-control problem: working memory selectively preserves task-relevant interface events with textual summaries and ROI-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.