MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control
Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, Kimin Lee

TL;DR
This paper introduces MobileSafetyBench, a comprehensive benchmark for evaluating the safety of autonomous mobile device-control agents powered by large language models, highlighting current safety challenges and potential mitigation strategies.
Contribution
The paper presents MobileSafetyBench, a new benchmark for safety evaluation of mobile agents, and proposes a prompting method to improve safety behaviors in LLM-based agents.
Findings
Baseline agents often fail to prevent harm during tasks.
Prompting methods can encourage safer agent behaviors.
Significant improvements are needed for user trust.
Abstract
Autonomous agents powered by large language models (LLMs) show promising potential in assistive tasks across various domains, including mobile device control. As these agents interact directly with personal information and device settings, ensuring their safe and reliable behavior is crucial to prevent undesirable outcomes. However, no benchmark exists for standardized evaluation of the safety of mobile device-control agents. In this work, we introduce MobileSafetyBench, a benchmark designed to evaluate the safety of device-control agents within a realistic mobile environment based on Android emulators. We develop a diverse set of tasks involving interactions with various mobile applications, including messaging and banking applications, challenging agents with managing risks encompassing misuse and negative side effects. These tasks include tests to evaluate the safety of agents in…
Peer Reviews
Decision·Submitted to ICLR 2025
+ This is an underexplored area of safety, and such a benchmark might be valuable to the community. + This clearly took a lot of work and the security/privacy examples are a valuable contribution! There are some very nice examples here such as requiring understanding of contextual information like: ```"The task instruction is defined as “Could you please forward the most recent message to John?”. The risk in Forwarding message task is that the most recent message contains a Google authentication
+ The major issue with this paper is the lack of specificity in its definitions of safety risks being evaluated. + The categories are so broad: "Legal Compliance," "Offensiveness," "Bias & Fairness," and "Private Information". + The legal compliance category is perhaps most problematic. First, laws are different across the world. Which laws are these examples intended to comply with? Second, many examples would not be against the law in the United States, among other countries—thought t
- The motivation to measure whether LLM-based agents can make safe actions is good. - The safety-guided CoT to prevent unsafe actions for agents seems interesting.
- It lacks a formal definition of the safety under the context of mobile LLM-based agents. - The experimental setting needs to be clarified. - The evaluation is not extensive, making the results not convincing enough.
The related works presented in the paper seem relevant
The main goal of this paper of presenting a benchmark is not achieved at all, even if we stick to a simple dictionary definition (i.e., where the purpose of a benchmark is to serve as a point of reference against which other methods will be compared, or a benchmark is seen as an evaluation of the performance of two or more 'systems'). The problems I see are as follows: 1. The paper uses a subjective point of comparison at every step, which makes the whole benchmark far from sound. Even from the
Code & Models
Videos
Taxonomy
TopicsAdvanced Malware Detection Techniques · Context-Aware Activity Recognition Systems · User Authentication and Security Systems
MethodsSparse Evolutionary Training
