# SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

**Authors:** Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo

arXiv: 2508.20018 · 2025-08-28

## TL;DR

SWIRL introduces a staged interleaved reinforcement learning workflow for multi-agent systems, improving stability and efficiency in mobile GUI control and other multi-agent tasks.

## Contribution

It reformulates multi-agent reinforcement learning into sequential single-agent tasks, providing theoretical guarantees and demonstrating superior performance in GUI control and reasoning.

## Key findings

- Outperforms existing methods on GUI benchmarks
- Ensures stable training with theoretical safety bounds
- Effective in multi-agent mathematical reasoning

## Abstract

The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20018/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20018/full.md

## References

73 references — full list in the complete paper: https://tomesphere.com/paper/2508.20018/full.md

---
Source: https://tomesphere.com/paper/2508.20018