MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

Juyong Lee; Dongyoon Hahm; June Suk Choi; W. Bradley Knox; Kimin Lee

arXiv:2410.17520·cs.LG·January 28, 2026

MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control

Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces MobileSafetyBench, a comprehensive benchmark for evaluating the safety of autonomous mobile device-control agents powered by large language models, highlighting current safety challenges and potential mitigation strategies.

Contribution

The paper presents MobileSafetyBench, a new benchmark for safety evaluation of mobile agents, and proposes a prompting method to improve safety behaviors in LLM-based agents.

Findings

01

Baseline agents often fail to prevent harm during tasks.

02

Prompting methods can encourage safer agent behaviors.

03

Significant improvements are needed for user trust.

Abstract

Autonomous agents powered by large language models (LLMs) show promising potential in assistive tasks across various domains, including mobile device control. As these agents interact directly with personal information and device settings, ensuring their safe and reliable behavior is crucial to prevent undesirable outcomes. However, no benchmark exists for standardized evaluation of the safety of mobile device-control agents. In this work, we introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents within a realistic mobile environment based on Android emulators. We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications, challenging agents with managing risks encompassing misuse and negative side effects. These tasks include tests to evaluate the safety of agents in…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

+ This is an underexplored area of safety, and such a benchmark might be valuable to the community. + This clearly took a lot of work and the security/privacy examples are a valuable contribution! There are some very nice examples here such as requiring understanding of contextual information like: ```"The task instruction is defined as “Could you please forward the most recent message to John?”. The risk in Forwarding message task is that the most recent message contains a Google authentication

Weaknesses

+ The major issue with this paper is the lack of specificity in its definitions of safety risks being evaluated. + The categories are so broad: "Legal Compliance," "Offensiveness," "Bias & Fairness," and "Private Information". + The legal compliance category is perhaps most problematic. First, laws are different across the world. Which laws are these examples intended to comply with? Second, many examples would not be against the law in the United States, among other countries—thought t

Reviewer 02Rating 3Confidence 4

Strengths

- The motivation to measure whether LLM-based agents can make safe actions is good. - The safety-guided CoT to prevent unsafe actions for agents seems interesting.

Weaknesses

- It lacks a formal definition of the safety under the context of mobile LLM-based agents. - The experimental setting needs to be clarified. - The evaluation is not extensive, making the results not convincing enough.

Reviewer 03Rating 1Confidence 5

Strengths

The related works presented in the paper seem relevant

Weaknesses

The main goal of this paper of presenting a benchmark is not achieved at all, even if we stick to a simple dictionary definition (i.e., where the purpose of a benchmark is to serve as a point of reference against which other methods will be compared, or a benchmark is seen as an evaluation of the performance of two or more 'systems'). The problems I see are as follows: 1. The paper uses a subjective point of comparison at every step, which makes the whole benchmark far from sound. Even from the

Code & Models

Repositories

jylee425/mobilesafetybench
noneOfficial

Videos

MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control· underline

Taxonomy

TopicsAdvanced Malware Detection Techniques · Context-Aware Activity Recognition Systems · User Authentication and Security Systems

MethodsSparse Evolutionary Training