FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

Qinglong Yang; Haoming Li; Haotian Zhao; Xiaokai Yan; Jingtao Ding; Fengli Xu; Yong Li

arXiv:2507.21071·cs.HC·March 17, 2026

FingerTip 20K: A Benchmark for Proactive and Personalized Mobile LLM Agents

Qinglong Yang, Haoming Li, Haotian Zhao, Xiaokai Yan, Jingtao Ding, Fengli Xu, Yong Li

PDF

3 Reviews

TL;DR

This paper introduces FingerTip 20K, a comprehensive benchmark dataset for developing proactive and personalized mobile GUI agents using multimodal large language models, emphasizing real-world user context and preferences.

Contribution

The paper presents a new benchmark dataset with 20K human demonstrations capturing user context for proactive and personalized mobile agent tasks, addressing gaps in existing GUI agent research.

Findings

01

Models struggle with leveraging user context effectively.

02

Fine-tuned models outperform baseline agents in personalized tasks.

03

Significant gap remains between current agents and human performance.

Abstract

Mobile GUI agents are becoming critical tools to improve user experience on smart devices, with multimodal large language models (MLLMs) emerging as the dominant paradigms in this domain. Current agents, however, rely on explicit human instructions, overlooking the potential to leverage the contextual information (like location, time, user profile) and historical data for proactive task suggestions. Besides, previous works focus on optimizing the success rate during task execution, but pay less attention to the personalized execution trajectory, thereby neglecting potentially vast differences in user preferences. To address these challenges, we introduce the FingerTip 20K benchmark. We collected 20K unique human demonstrations of multi-step Android device interactions across a variety of everyday apps. These demonstrations are not isolated but are continuously acquired from the users'…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

- Truly real data. Unlike existing benchmarks that rely on emulator or auto-exploration, this is real phones + real users + real daily intents, so the distribution shift is authentic. - New task formulations. Proactive recommendation and “execute like this user” are exactly what current mobile agents lack; existing benchmarks mostly test “can you follow a given instruction.” - Rich context signals. Time, location category, user profile, and multi-intent history make it suitable for modeling pre

Weaknesses

- Privacy / deployment gap. Real-world agents won’t always have such clean, explicit user-intent annotations; some discussion of weaker supervision would help.

Reviewer 02Rating 4Confidence 3

Strengths

* The new benchmark targets the twos problems of proactive task suggestion and personalized task execution, which are import to agent applications. * The proposed benchmark is well designed with sufficient diversity covering many real-world applications. * The paper reports the results of a set of algorithms which can be used as baselines for future research. Also, the proposed evaluation metric seems to be reasonable to correspond the human behaviors.

Weaknesses

* The presentation of the paper should be improved. For example, the figures and charts in the paper can be replaced with vector image, which can provide better visual quality. * For the baseline evaluations in Table 3 and Table 4, it would be better to include more vlms for a more complete evaluation. For example, how about the performance with more parameters like 72B vs 7B? Also, how about the performance with thinking model compared with non-thinking model? * The user preference may be sub

Reviewer 03Rating 6Confidence 4

Strengths

1. Real-World Data: Collected from long-term, in-the-wild mobile users, the dataset captures authentic intents and behaviors, offering far higher ecological validity than synthetic or simulator-based settings. 2. Comprehensive Context Annotation: Each episode includes detailed user profiles, interaction metadata, and historical action, enabling advanced personalization and preference modeling. 3. Rigorous Task & Metric Design: Both tracks are clearly formalized with well-defined metrics and base

Weaknesses

1. Limited Demographic Scope: All data originate from Chinese Android users, restricting global generalizability and the benchmark may including implicit bias. 2. Missing Related Work: The paper does not compare with several key studies on GUI benchmarks (e.g., AndroidWorld) and proactive LLM-based agents or personalization systems (e.g., AutoDroid, AppAgent). 3. Superficial Ethics Discussion: Privacy and anonymization issues are acknowledged but insufficiently analyzed.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.