MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning

Kun Huang; Weikai Xu; Yuxuan Liu; Quandong Wang; Pengzhi Gao; Wei Liu; Jian Luan; Bin Wang; Bo An

arXiv:2505.12299·cs.CL·March 24, 2026

MobileIPL: Enhancing Mobile Agents Thinking Process via Iterative Preference Learning

Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, Bo An

PDF

Open Access 3 Reviews

TL;DR

This paper introduces MobileIPL, a novel iterative preference learning approach that enhances mobile agent reasoning by constructing CoaT-trees, leveraging GPT-4o for diverse data, and achieving state-of-the-art results on GUI benchmarks.

Contribution

The paper proposes a new iterative preference learning method with a CoaT-tree construction, rule-based scoring, and a three-stage instruction evolution to improve mobile agent reasoning and generalization.

Findings

01

MobileIPL outperforms strong baselines on three GUI benchmarks.

02

Achieves state-of-the-art performance across multiple Mobile GUI-Agents benchmarks.

03

Demonstrates strong generalization to out-of-domain scenarios.

Abstract

The Chain of Action-Planning Thoughts (CoaT) paradigm has been shown to improve the reasoning performance of VLM-based mobile agents in GUI tasks. However, the scarcity of diverse CoaT trajectories limits the expressiveness and generalization ability of such agents. While self-training is commonly employed to address data scarcity, existing approaches either overlook the correctness of intermediate reasoning steps or depend on expensive process-level annotations to construct process reward models (PRM). To address the above problems, we propose an Iterative Preference Learning (IPL) that constructs a CoaT-tree through interative sampling, scores leaf nodes using rule-based reward, and backpropagates feedback to derive Thinking-level Direct Preference Optimization (T-DPO) pairs. To prevent overfitting during warm-up supervised fine-tuning, we further introduce a three-stage instruction…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The GUI agents are becoming more and more popular recently, which is a very interesting and highly practical direction. The overall presentation and paper writing is very clear. The pipeline components, i.e., SFT data collection and preference training annotation and filtering, are all illustrated well, which are helpful in advancing open-source GUI agents development. The results are comprehensive and convincing, especially demonstrating OOD performance.

Weaknesses

From a novelty standpoint, the primary contributions of this work lie in building a comprehensive end-to-end pipeline, whereas most of the technical components appear to be adaptations of existing methods or relatively straightforward extensions, especially in light of the recent surge of research on agentic system design and RL-based training. I sincerely appreciate the considerable engineering effort invested in large-scale SFT data collection and RL system implementation—this is clearly valua

Reviewer 02Rating 4Confidence 3

Strengths

- The CoaT-tree construction via iterative sampling enables fine-grained reasoning optimization without manual step annotations and seems novel. - Experimental results on AITZ, AMEX, and AndroidControl indicates that MobileIPL outperforms previous strong baselines with less data usage.

Weaknesses

- A few recent related works are not well-discussed. For example, TCPO [1] proposes thought-centric preference optimization, which is similar to the Thinking-level DPO proposed in this work. TreePO [2], TreeRL [3], SPO [4] all introduces tree-structure rollout and value backpropagation, which is similar to the iterative sampling process of MobileIPL. - The clarity is not clear. For example, (1) Section 3.3 is not a complete part. It introduces Iterative Preference Learning. However, this secti

Reviewer 03Rating 4Confidence 3

Strengths

#### Strengths (1-2 points) * **1. Cost-Effective and Scalable Process Improvement** * **Detail**: MobileIPL successfully addresses the bottleneck of requiring **expensive process-level annotations** for building Process Reward Models (PRMs). By utilizing an Iterative Preference Learning approach, the method provides a scalable and cost-efficient mechanism to automatically guide and improve the quality of the agent's intermediate reasoning steps. * **2. Direct Focus on Intermediate Reasoni

Weaknesses

#### Weaknesses (3-4 points) * **1. Failure to Address Root Causes of Hallucination** * **Detail**: The paper highlights severe errors like **Hallucinated Thought** and **Fabricated Elements**. While MobileIPL attempts to suppress these errors via preference learning, it **does not fundamentally solve** the VLM's underlying limitations in **visual grounding** and accurate internal state tracking, leaving the model susceptible to imagining non-existent states or elements. * **2. High Domain

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Agent-Based Network Management · IPv6, Mobility, Handover, Networks, Security · Context-Aware Activity Recognition Systems

Methods+ ( 1 ) ⟷ 888 ⟷ ( 829 ) ⟷ 0881||How do I resolve a dispute on Expedia? · Co-Scale Conv-attentional Image Transformer