MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments
Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Qinyi Luo, Shunye Tang, Yuxiang Chai, Weifeng Lin, Han Xiao, WenHao Wang, Siheng Chen, Zhengxi Lu, Gao Wu, Hao Wang, Liang Liu, Yong Liu

TL;DR
MemGUI-Bench is a new comprehensive benchmark designed to evaluate and analyze the memory capabilities of mobile GUI agents in dynamic environments, addressing gaps in existing assessments.
Contribution
It introduces a systematic memory taxonomy, a large set of challenging tasks, an automated evaluation pipeline, and a thorough assessment of state-of-the-art agents.
Findings
All evaluated agents show significant memory deficits.
Five distinct failure modes identified.
Five actionable design implications proposed.
Abstract
Current mobile GUI agent benchmarks systematically fail to assess memory capabilities, with only 5.2-11.8% memory-related tasks and no cross-session learning evaluation. We introduce MemGUI-Bench, a comprehensive memory-centric benchmark with pass@k and staged LLM-as-judge evaluation. Our contributions include: (1) a systematic memory taxonomy analyzing 11 agents across 5 architectures; (2) 128 tasks across 26 applications where 89.8% challenge memory through cross-temporal and cross-spatial retention; (3) MemGUI-Eval, an automated pipeline with Progressive Scrutiny and 7 hierarchical metrics; and (4) RQ-driven assessment of 11 state-of-the-art agents. Our experiments reveal significant memory deficits across all evaluated systems, identify 5 distinct failure modes, and synthesize 5 actionable design implications. All resources including code, benchmark, and evaluation results will be…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper identifies a clear gap in the existing mobile-agent benchmarks. MemGUI-Bench bridges the gap with a clear focus. - The modular feature of MemGUI-Bench eval makes it easy to integrate with existing benchmarks. - Surprising low performance of existing state-of-the-art methods on the benchmark, substantiating its claims on the memory-gap.
- Paper formatting for Table 1 and Figure 4. - Lack of qualitative examples to showcase the memory-gap in state-of-the-art model like UI-TARS. Adding such examples will demonstrate the need of the benchmark more clearly. - L27: "First comprehensive benchmark for GUI-agent memory" is plausible, but Table 4 shows prior memory tasks exist (e.g., SPA-Bench has 40/340). I would suggest qualifying to "first comprehensive, memory-centric benchmark with pass@k and a staged LLM-as-judge evaluator."
- Timely problem framing. Mobile GUI agents are rising; a purpose-built memory benchmark is valuable and under-served. The cross-temporal/cross-spatial emphasis aligns with real usage. - Scale & coverage. 128 tasks / 26 apps is non-trivial for interactive GUI evaluation; the claimed memory-task share (~89.8%) suggests deliberate design rather than incidental memory. - Evaluation pipeline ambition. “Progressive Scrutiny” + hierarchical metrics aim to move beyond pass/fail, which is the right di
- **Generalization beyond the curated suite**: The benchmark spans 26 apps, but are they category-balanced (commerce, productivity, social, finance), regionally representative, and covering UI paradigms (infinite scroll, nested modals, webviews)? Without a sampling rationale and held-out app categories, it’s unclear if results generalize or if models “learn the test.” - **Memory vs. perception/exploration confound**: It’s unclear whether measured failures are truly memory failures versus UI per
1. **Large-Scale Effort.** > This paper analyzes 11 agents and aggregates tens of applications, covering major works in the domain of mobile manipulation. 2. **Comprehensive agent support and evaluation protocol.** > As introduced in section sec 3.2, this work supports a unified pipeline to ensure robust agent evaluation. Also, the metrics proposed in sec 4 enable memory-targeting evaluations, with human-annotated references.
1. **Limited practical utility among agent developments.** > First, memory seems like a useful component designed in *some* works, yet not a universal feature that needs to be incorporated by agents. Therefore, evaluating memory is an interval, intermediate self-check for some agents, rather than a universal correctness metric such as task success rates. > Second, this work structure agent memory possibly inspired by how human memory works (this point is less justified as well), yet this may no
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Software Engineering Methodologies · Mobile Agent-Based Network Management · Artificial Intelligence in Games
