OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability
Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim

TL;DR
OpenApps introduces a configurable ecosystem of apps to evaluate UI-Agent reliability across diverse app variations, revealing significant fluctuations in success rates and behaviors that are overlooked by fixed-environment assessments.
Contribution
The paper presents OpenApps, a lightweight, open-source platform enabling large-scale evaluation of UI-Agents across varied app configurations, addressing a critical blind spot in current reliability assessments.
Findings
Reliability varies drastically across app variations.
Task success rates can fluctuate by more than 50%.
Agent behaviors like looping differ with environment configurations.
Abstract
Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent's ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multi-Agent Systems and Negotiation · Social Robot Interaction and HRI
