Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
Shiquan Zhang, Tianyi Zhang, Le Fang, Simon D'Alfonso, Hong Jia, Vassilis Kostakos

TL;DR
This study introduces DailyDroid, a comprehensive benchmark for evaluating LLM-driven smartphone automation, revealing key failure points and the marginal benefits of multimodal inputs over text-only methods.
Contribution
The paper presents DailyDroid, a new benchmark with extensive failure analysis, highlighting challenges in UI accessibility and input modalities for LLM-based mobile agents.
Findings
Multimodal inputs slightly improve success rates over text-only inputs.
Common failures include UI accessibility issues and misinterpretation of instructions.
Insights suggest improvements needed in UI design and input modalities for better LLM performance.
Abstract
With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
