InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Qihang Ai; Pi Bu; Yue Cao; Yingyao Wang; Jihao Gu; Jingxuan Xing; Zekun Zhu; Wei Jiang; Zhicheng Zheng; Jun Song; Yuning Jiang

arXiv:2508.19679·cs.AI·April 28, 2026

InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning

Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, Yuning Jiang

PDF

TL;DR

InquireMobile is a reinforcement learning-based mobile agent model that actively requests human assistance at critical points, significantly improving safe interaction capabilities in real-world environments.

Contribution

The paper introduces InquireMobile, a novel model with a two-stage training strategy and pre-action reasoning, enhancing proactive inquiry and safety in VLM-based mobile agents.

Findings

01

46.8% improvement in inquiry success rate

02

Achieves best overall success rate on InquireBench

03

Open-sources datasets, models, and evaluation code

Abstract

Recent advances in Vision-Language Models (VLMs) have enabled mobile agents to perceive and interact with real-world mobile environments based on human instructions. However, the current fully autonomous paradigm poses potential safety risks when model understanding or reasoning capabilities are insufficient. To address this challenge, we first introduce \textbf{InquireBench}, a comprehensive benchmark specifically designed to evaluate mobile agents' capabilities in safe interaction and proactive inquiry with users, encompassing 5 categories and 22 sub-categories, where most existing VLM-based agents demonstrate near-zero performance. In this paper, we aim to develop an interactive system that actively seeks human confirmation at critical decision points. To achieve this, we propose \textbf{InquireMobile}, a novel model inspired by reinforcement learning, featuring a two-stage training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.