Generalization in Online Reinforcement Learning for Mobile Agents

Li Gu; Zihuan Jiang; Zhixiang Chi; Huan Liu; Ziqiang Wang; Yuanhao Yu; Glen Berseth; Yang Wang

arXiv:2603.07432·cs.CV·March 10, 2026

Generalization in Online Reinforcement Learning for Mobile Agents

Li Gu, Zihuan Jiang, Zhixiang Chi, Huan Liu, Ziqiang Wang, Yuanhao Yu, Glen Berseth, Yang Wang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces AndroidWorld-Generalization, a benchmark for evaluating zero-shot generalization of vision-language RL agents in mobile GUI tasks, and proposes an RL training system that improves performance on unseen instances.

Contribution

It formalizes the generalization problem in mobile GUI tasks as a CMDP, creates a new benchmark, and develops an RL training system with open-source tools for reproducibility.

Findings

01

RL agents outperform supervised baselines on unseen instances

02

Limited gains on unseen templates and apps highlight generalization challenges

03

Few-shot test-time adaptation improves performance on unseen apps

Abstract

Graphical user interface (GUI)-based mobile agents automate digital tasks on mobile devices by interpreting natural-language instructions and interacting with the screen. While recent methods apply reinforcement learning (RL) to train vision-language-model(VLM) agents in interactive environments with a primary focus on performance, generalization remains underexplored due to the lack of standardized benchmarks and open-source RL systems. In this work, we formalize the problem as a Contextual Markov Decision Process (CMDP) and introduce \textbf{AndroidWorld-Generalization}, a benchmark with three increasingly challenging regimes for evaluating zero-shot generalization to unseen task instances, templates, and applications. We further propose an RL training system that integrates Group Relative Policy Optimization (GRPO) with a scalable rollout collection system, consisting of…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

* The formalization of generalization as CDMP is nice. * Results demonstrate that, like prior work has found, RL-based fine-tuning can improve agent performance, even in difficult generalization problems.

Weaknesses

* The intro has a ton of citations in it, these should really only be in the related work unless they are central to the core argument of the paper * The test sizes are very small. What is prohibiting you from generating even more held-out test data, since it's all generated anyway? Especially because in Section 5.1 you are already expanding it to even more templates for this experiment. Perhaps I am wrong about this estimate because the format of Table 1 is confusing, but computing accuracy ove

Reviewer 02Rating 2Confidence 3

Strengths

1. This paper is clearly written and easy to follow 2. This paper studies generalization to new tasks, which is a very important question about algorithm evaluation. 3. The benchmark and training framework is going to be open-sourced

Weaknesses

My main criticism about this paper is that I do not follow the overall logic of the paper. The paper starts with a benchmark, highlighting that it can test the model's performance on unseen tasks. However, they use the environments from AndroidWorld, which limits the novelty. Next, the paper proposes an RL training framework. A framework should facilitate the implementation of many algorithms, whereas the paper only implement GRPO. The paper claims that the the agent trained under this framework

Reviewer 03Rating 6Confidence 4

Strengths

- Novel benchmark design: **Generalization is a big issue and bottleneck under this topic.** The three-tiered generalization benchmark (Unseen Instance/Template/App) provides a systematic framework for evaluating zero-shot transfer, addressing a significant gap in existing mobile agent research - Practical system design: The scalable rollout collection system with Docker containerization, asynchronous execution, and error recovery addresses real engineering challenges in RL for mobile environme

Weaknesses

- Limited scale: The benchmark covers only 20 applications and 116 templates, which the authors acknowledge constrains generalization evaluation and training diversity - Poor generalization to harder regimes: The dramatic performance drop on Unseen Template (15.7%) and especially Unseen App (8.3%) suggests fundamental limitations that aren't adequately addressed - Few-shot adaptation is underdeveloped: The test-time adaptation experiments (Section 5.1, Q3) are preliminary and don't explore imp

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning