Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

Shiquan Zhang; Tianyi Zhang; Le Fang; Simon D'Alfonso; Hong Jia; Vassilis Kostakos

arXiv:2604.17817·cs.HC·April 21, 2026

Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots

Shiquan Zhang, Tianyi Zhang, Le Fang, Simon D'Alfonso, Hong Jia, Vassilis Kostakos

PDF

TL;DR

This study introduces DailyDroid, a comprehensive benchmark for evaluating LLM-driven smartphone automation, revealing key failure points and the marginal benefits of multimodal inputs over text-only methods.

Contribution

The paper presents DailyDroid, a new benchmark with extensive failure analysis, highlighting challenges in UI accessibility and input modalities for LLM-based mobile agents.

Findings

01

Multimodal inputs slightly improve success rates over text-only inputs.

02

Common failures include UI accessibility issues and misinterpretation of instructions.

03

Insights suggest improvements needed in UI design and input modalities for better LLM performance.

Abstract

With the rapid advancement of large language models (LLMs), mobile agents have emerged as promising tools for phone automation, simulating human interactions on screens to accomplish complex tasks. However, these agents often suffer from low accuracy, misinterpretation of user instructions, and failure on challenging tasks, with limited prior work examining why and where they fail. To address this, we introduce DailyDroid, a benchmark of 75 tasks in five scenarios across 25 Android apps, spanning three difficulty levels to mimic everyday smartphone use. We evaluate it using text-only and multimodal (text + screenshot) inputs on GPT-4o and o4-mini across 300 trials, revealing comparable performance with multimodal inputs yielding marginally higher success rates. Through in-depth failure analysis, we compile a handbook of common failures. Our findings reveal critical issues in UI…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.