OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

Karen Ullrich; Jingtong Su; Claudia Shi; Arjun Subramonian; Amir Bar; Ivan Evtimov; Nikolaos Tsilivis; Randall Balestriero; Julia Kempe; Mark Ibrahim

arXiv:2511.20766·cs.AI·November 27, 2025

OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim

PDF

Open Access

TL;DR

OpenApps introduces a configurable ecosystem of apps to evaluate UI-Agent reliability across diverse app variations, revealing significant fluctuations in success rates and behaviors that are overlooked by fixed-environment assessments.

Contribution

The paper presents OpenApps, a lightweight, open-source platform enabling large-scale evaluation of UI-Agents across varied app configurations, addressing a critical blind spot in current reliability assessments.

Findings

01

Reliability varies drastically across app variations.

02

Task success rates can fluctuate by more than 50%.

03

Agent behaviors like looping differ with environment configurations.

Abstract

Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent's ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multi-Agent Systems and Negotiation · Social Robot Interaction and HRI