SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

Quanfeng Lu; Zhantao Ma; Shuai Zhong; Jin Wang; Dahai Yu; Michael K. Ng; Ping Luo

arXiv:2508.20018·cs.AI·August 28, 2025

SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

Quanfeng Lu, Zhantao Ma, Shuai Zhong, Jin Wang, Dahai Yu, Michael K. Ng, Ping Luo

PDF

3 Reviews

TL;DR

SWIRL introduces a staged interleaved reinforcement learning workflow for multi-agent systems, improving stability and efficiency in mobile GUI control and other multi-agent tasks.

Contribution

It reformulates multi-agent reinforcement learning into sequential single-agent tasks, providing theoretical guarantees and demonstrating superior performance in GUI control and reasoning.

Findings

01

Outperforms existing methods on GUI benchmarks

02

Ensures stable training with theoretical safety bounds

03

Effective in multi-agent mathematical reasoning

Abstract

The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents.…

Figures12

Click any figure to enlarge with its caption.

Tables10

Table 1. Table 1: Comparison of actor parameters loaded during training.

Method	Actor parameters
HAPPO	$\sum_{i = 1}^{N} \| θ_{i} \|$
A2PO	$\sum_{i = 1}^{N} \| θ_{i} \|$
MARFT	$\sum_{i = 1}^{N} \| θ_{i} \|$
SWIRL (Ours)	$\max {\| θ_{i} \|}$

Table 2. Table 2: Performance comparison on the AndroidControl-High and GUIOdyssey datasets. Bold and underline indicate the best and second-best performers, respectively.

Models	Method	AndroidControl-High			GUIOdyssey			Overall
Models	Method	Type	GR	SR	Type	GR	SR	Overall
GPT-4o	Closed-Source	63.06	30.90	21.17	37.50	14.17	5.36	28.69
OS-Atlas-4B	SFT	49.01	49.51	22.77	49.63	34.63	20.25	37.63
OS-Atlas-7B	SFT	57.44	54.90	29.83	60.42	39.74	26.96	44.88
UI-R1-3B	RFT	43.97	63.99	26.30	20.93	56.35	8.65	36.70
UI-R1-E-3B	RFT	29.67	61.50	14.37	7.59	62.87	1.94	29.66
GUI-R1-3B	RFT	58.04	56.24	46.55	54.84	41.52	41.33	49.75
GUI-R1-7B	RFT	71.63	65.56	51.67	65.49	43.64	38.79	56.13
ReGUIDE-7B	RFT	–	–	50.00	–	–	–	–
GPT-5 / UI-R1-3B	Multi-Agent	55.08	73.47	37.27	62.97	62.42	35.93	54.52
GPT-5 / UI-R1-E-3B	Multi-Agent	60.36	74.79	40.65	65.28	63.41	38.22	57.12
\rowcolorgreen!12 GPT-5 / Interactor	Multi-Agent	64.68	74.62	49.53	68.77	62.70	44.21	60.75
\rowcolorvarysorange!25 Navigator / Interactor	SWIRL	66.72	71.19	51.24	74.87	66.39	51.65	63.68

Table 3. Table 3: Performance comparison on web and desktop low-level tasks. The best and second-best in each column are indicated by bold and underline , respectively.

Models	GUI-Act-Web			OmniAct-Web			OmniAct-Desktop			Overall
Models	Type	GR	SR	Type	GR	SR	Type	GR	SR	Overall
GPT-4o	77.09	45.02	41.84	79.33	42.79	34.06	79.97	63.25	50.67	57.11
OS-Atlas-4B	79.22	58.57	42.62	46.74	49.24	22.99	63.30	42.55	26.94	48.02
OS-Atlas-7B	86.95	75.61	57.02	85.63	69.35	59.15	90.24	62.87	56.73	71.51
UI-R1-3B	75.89	79.43	67.31	75.42	61.35	61.33	73.41	64.12	63.98	69.14
GUI-R1-3B	89.86	87.42	76.31	88.58	75.10	75.08	91.86	78.37	78.31	82.32
GUI-R1-7B	90.85	88.06	80.31	91.16	77.29	77.35	92.20	83.36	83.33	84.88
\rowcolorgreen!12 Interactor	95.00	87.85	84.85	94.52	81.67	77.32	94.65	77.09	72.97	85.10

Table 4. Table 4: Performance comparison on the AndroidControl-Low dataset. Bold and underline indicate the best and second-best performers, respectively.

Model	AndroidControl-Low
Model	Type	GR	SR
GPT-4o	74.33	38.67	28.39
OS-Atlas-4B	64.58	71.19	40.62
OS-Atlas-7B	73.00	73.37	50.94
UI-R1	72.49	88.48	57.37
UI-R1-E	73.91	91.91	55.58
GUI-R1-3B	83.58	81.59	64.41
GUI-R1-7B	85.17	84.02	66.52
ReGUIDE-7B	–	–	67.40
SE-GUI-7B	–	79.60	68.20
\rowcolorgreen!12 Interactor	84.62	92.20	78.81

Table 5. Table 5: Math results for training-time multi-agent frameworks. SWIRL and MARFT share the same base model Qwen2.5-Coder-3B-Instruct, and are both trained on the MATH training set.

Model	Method	Training Data	MATH500	CMATH	GSM8K	Overall
Qwen2.5-Coder-3B-Instruct	Multi-Agent	–	47.2	81.1	77.3	68.5
MARFT	Multi-Agent	MATH	49.8	83.0	78.7	70.5
SWIRL (ours)	Multi-Agent	MATH	64.6	83.5	81.4	76.5

Table 6. Table 6: Action spaces used for each dataset.

Dataset	Action Space
AITW, AMEX	click(point=‘(x1, y1)’) type(content=‘xxx’) scroll(direction=‘down\|up\|right\|left’) press_home() press_back() press_enter() finished()
AndroidControl	click(point=‘(x1, y1)’) long_press(point=‘(x1, y1)’) type(content=‘xxx’) scroll(direction=‘down\|up\|right\|left’) open_app(app_name=‘xxx’) press_home() press_back() wait() finished()
GUIOdyssey	click(point=‘(x1, y1)’) long_press(point=‘(x1, y1)’) type(content=‘xxx’) scroll(direction=‘down\|up\|right\|left’) press_home() press_back() press_appselect() error(content=‘xxx’) finished()
GUI-Act-Web	click(point=‘(x1, y1)’) scroll(direction=’down\|up’)
OmniAct-Web)	click(point=‘(x1, y1)’) rightclick(point=‘(x1, y1)’) scroll(direction=‘down\|up’)
OmniAct-Desktop	click(point=‘(x1, y1)’) rightclick(point=‘(x1, y1)’) doubleclick(point=‘(x1, y1)’) moveto(point=‘(x1, y1)’) scroll(direction=‘down\|up’)

Table 7. Table 7: Effect of interactive updates. Numbers in italics indicate performance gains relative to Stage 1 with 1 , 500 1,500 samples.

Training Strategy	AndroidControl-High			GUIOdyssey			Overall
Training Strategy	Type	GR	SR	Type	GR	SR	Overall
Stage 1 $+$ Stage 1	63.77 (+0.90)	70.25 (+0.35)	48.04 (+1.21)	72.25 (+1.52)	66.19 (+2.87)	49.64 (+3.22)	61.69 (+1.68)
Stage 1 $+$ Stage 2	66.72 (+3.85)	71.19 (+1.29)	51.24 (+4.41)	74.87 (+4.14)	66.39(+3.07)	51.65(+5.23)	63.68 (+3.67)

Table 8. Table 8: Effect of interleaved updates on Interactor performance in low-level tasks.

Training Strategy	GUI-Act-Web			OmniAct-Web			OmniAct-Desktop			AndroidControl-Low			Overall
Training Strategy	Type	GR	SR	Type	GR	SR	Type	GR	SR	Type	GR	SR	Overall
Stage 1	94.31	88.22	77.00	89.22	81.62	72.40	91.38	74.45	68.03	85.48	71.91	68.87	80.24
Stage 1 $+$ Stage 2	95.00	87.85	84.85	94.52	81.67	77.32	94.65	77.09	72.97	84.62	92.20	78.81	85.13

Table 9. Table 9: Performance of multi-Agent systems on high-level tasks with Navigators trained with and without interleaved updates.

Stage 2 Training	AndroidControl-High			GUIOdyssey			Overall
Stage 2 Training	Type	GR	SR	Type	GR	SR	Overall
✗	63.16	54.65	39.91	73.19	60.49	46.51	56.32
✓	66.72	71.19	51.24	74.87	66.39	51.65	63.68

Table 10. Table 10: Effect of the rounds-epochs schedule under a fixed training budget. Here, ‘Epochs/round’ denotes the number of epochs trained within each round. The total training epochs are held constant across settings.

Rounds	Epochs/round	AndroidControl			GUIOdyssey			Overall
Rounds	Epochs/round	Type	GR	SR	Type	GR	SR	Overall
2	10	65.74	70.42	49.88	73.13	64.85	49.27	62.21
5	4	65.26	70.94	49.81	73.85	65.21	49.98	62.51
10	2	66.00	70.91	50.56	74.03	65.71	50.58	62.96

Equations34

J(\Pi_{k,i,j+1})\ \geq\ J(\Pi_{k,i,j})\ +\ L^{\,i}_{\Pi_{k,i,j}}\!\big{(}\tau^{-i}_{k},\,\pi^{\,i}_{k,j+1}\big{)}-\ C_{k,i,j}\,D^{\max}_{\mathrm{KL}}\!\big{(}\pi^{\,i}_{k,j},\,\pi^{\,i}_{k,j+1}\big{)}.

J(\Pi_{k,i,j+1})\ \geq\ J(\Pi_{k,i,j})\ +\ L^{\,i}_{\Pi_{k,i,j}}\!\big{(}\tau^{-i}_{k},\,\pi^{\,i}_{k,j+1}\big{)}-\ C_{k,i,j}\,D^{\max}_{\mathrm{KL}}\!\big{(}\pi^{\,i}_{k,j},\,\pi^{\,i}_{k,j+1}\big{)}.

J (θ_{n}, θ_{i}) = E_{r \sim π (I, H_{t}, X_{t}) \sim D} k, ℓ \sum j \in {n, i} \sum \frac{I _{k, ℓ}^{(j)}}{K \sum _{ℓ = 1}^{∣ r_{k} ∣} I _{k, ℓ}^{(j)}} \cdot (clip (v_{k, ℓ}^{(j)}, A_{k}) - λ D_{KL} [π_{θ_{j}} ∥ π_{θ_{j}}^{ref}]) .

J (θ_{n}, θ_{i}) = E_{r \sim π (I, H_{t}, X_{t}) \sim D} k, ℓ \sum j \in {n, i} \sum \frac{I _{k, ℓ}^{(j)}}{K \sum _{ℓ = 1}^{∣ r_{k} ∣} I _{k, ℓ}^{(j)}} \cdot (clip (v_{k, ℓ}^{(j)}, A_{k}) - λ D_{KL} [π_{θ_{j}} ∥ π_{θ_{j}}^{ref}]) .

θ_{n}^{(k + 1)} \leftarrow R L (θ_{n} ∣ J, θ_{n}^{(k)}, θ_{i}^{(k)}, μ_{n});

θ_{n}^{(k + 1)} \leftarrow R L (θ_{n} ∣ J, θ_{n}^{(k)}, θ_{i}^{(k)}, μ_{n});

θ_{i}^{(k + 1)} \leftarrow R L (θ_{i} ∣ J, θ_{n}^{(k + 1)}, θ_{i}^{(k)}, μ_{i});

θ_{i}^{(k + 1)} \leftarrow R L (θ_{i} ∣ J, θ_{n}^{(k + 1)}, θ_{i}^{(k)}, μ_{i});

J (π) := E_{π} [t = 0 \sum \infty γ^{t} r (s_{t}, a_{t})] .

J (π) := E_{π} [t = 0 \sum \infty γ^{t} r (s_{t}, a_{t})] .

\Pi_{k,i,j}:=\big{(}\{\pi^{\,r}_{k+1}\}_{r<i},\ \pi^{\,i}_{k,j},\ \{\pi^{\,r}_{k}\}_{r>i}\big{)}=\big{(}\tau^{-i}_{k},\,\pi^{\,i}_{k,j}\big{)},

\Pi_{k,i,j}:=\big{(}\{\pi^{\,r}_{k+1}\}_{r<i},\ \pi^{\,i}_{k,j},\ \{\pi^{\,r}_{k}\}_{r>i}\big{)}=\big{(}\tau^{-i}_{k},\,\pi^{\,i}_{k,j}\big{)},

\tau^{-i}_{k}:=\big{(}\{\pi^{\,r}_{k+1}\}_{r<i},\ \{\pi^{\,r}_{k}\}_{r>i}\big{)}.

\tau^{-i}_{k}:=\big{(}\{\pi^{\,r}_{k+1}\}_{r<i},\ \{\pi^{\,r}_{k}\}_{r>i}\big{)}.

V_{\pi}(s):=\mathbb{E}_{\pi}\!\Big{[}\sum_{t\geq 0}\gamma^{t}r(s_{t},a_{t})\,\Big{|}\,s_{0}=s\Big{]},\qquad Q_{\pi}(s,a):=\mathbb{E}_{\pi}\!\Big{[}\sum_{t\geq 0}\gamma^{t}r(s_{t},a_{t})\,\Big{|}\,s_{0}=s,\,a_{0}=a\Big{]}.

V_{\pi}(s):=\mathbb{E}_{\pi}\!\Big{[}\sum_{t\geq 0}\gamma^{t}r(s_{t},a_{t})\,\Big{|}\,s_{0}=s\Big{]},\qquad Q_{\pi}(s,a):=\mathbb{E}_{\pi}\!\Big{[}\sum_{t\geq 0}\gamma^{t}r(s_{t},a_{t})\,\Big{|}\,s_{0}=s,\,a_{0}=a\Big{]}.

Q^{i}_{\pi}(s,a^{i}):=\mathbb{E}_{a^{-i}\sim\pi^{-i}(\cdot\mid s)}\big{[}\,Q_{\pi}\big{(}s,(a^{-i},a^{i})\big{)}\,\big{]},\qquad A^{\,i}_{\pi}(s,a^{i}):=Q^{i}_{\pi}(s,a^{i})-V_{\pi}(s).

Q^{i}_{\pi}(s,a^{i}):=\mathbb{E}_{a^{-i}\sim\pi^{-i}(\cdot\mid s)}\big{[}\,Q_{\pi}\big{(}s,(a^{-i},a^{i})\big{)}\,\big{]},\qquad A^{\,i}_{\pi}(s,a^{i}):=Q^{i}_{\pi}(s,a^{i})-V_{\pi}(s).

L^{\,i}_{\Pi_{k,i,j}}\!\big{(}\tau^{-i}_{k},\hat{\pi}^{i}\big{)}:=\mathbb{E}_{\begin{subarray}{c}s\sim\rho_{\Pi_{k,i,j}}\\ a^{i}\sim\hat{\pi}^{i}(\cdot\mid s)\end{subarray}}\Big{[}\,A^{\,i}_{\Pi_{k,i,j}}(s,a^{i})\,\Big{]}.

L^{\,i}_{\Pi_{k,i,j}}\!\big{(}\tau^{-i}_{k},\hat{\pi}^{i}\big{)}:=\mathbb{E}_{\begin{subarray}{c}s\sim\rho_{\Pi_{k,i,j}}\\ a^{i}\sim\hat{\pi}^{i}(\cdot\mid s)\end{subarray}}\Big{[}\,A^{\,i}_{\Pi_{k,i,j}}(s,a^{i})\,\Big{]}.

F_{k,i,j}(\hat{\pi}^{i})\ :=\ L^{\,i}_{\Pi_{k,i,j}}(\tau^{-i}_{k},\hat{\pi}^{i})\ -\ C_{k,i,j}\,D^{\max}_{\mathrm{KL}}\!\big{(}\pi^{\,i}_{k,j},\hat{\pi}^{i}\big{)}.

F_{k,i,j}(\hat{\pi}^{i})\ :=\ L^{\,i}_{\Pi_{k,i,j}}(\tau^{-i}_{k},\hat{\pi}^{i})\ -\ C_{k,i,j}\,D^{\max}_{\mathrm{KL}}\!\big{(}\pi^{\,i}_{k,j},\hat{\pi}^{i}\big{)}.

J(\Pi_{k,i,j+1})\ \geq\ J(\Pi_{k,i,j})\ +\ L^{\,i}_{\Pi_{k,i,j}}\!\big{(}\tau^{-i}_{k},\,\pi^{\,i}_{k,j+1}\big{)}-\ C_{k,i,j}\,D^{\max}_{\mathrm{KL}}\!\big{(}\pi^{\,i}_{k,j},\,\pi^{\,i}_{k,j+1}\big{)}.

J(\Pi_{k,i,j+1})\ \geq\ J(\Pi_{k,i,j})\ +\ L^{\,i}_{\Pi_{k,i,j}}\!\big{(}\tau^{-i}_{k},\,\pi^{\,i}_{k,j+1}\big{)}-\ C_{k,i,j}\,D^{\max}_{\mathrm{KL}}\!\big{(}\pi^{\,i}_{k,j},\,\pi^{\,i}_{k,j+1}\big{)}.

J(\Pi_{k,i,j+1})-J(\Pi_{k,i,j})\ \geq\ F_{k,i,j}\!\big{(}\pi^{\,i}_{k,j+1}\big{)}\ \geq\ 0

J(\Pi_{k,i,j+1})-J(\Pi_{k,i,j})\ \geq\ F_{k,i,j}\!\big{(}\pi^{\,i}_{k,j+1}\big{)}\ \geq\ 0

J(\pi_{k+1})-J(\pi_{k})=\sum_{i=1}^{n}\sum_{j=0}^{K_{i}-1}\big{[}J(\Pi_{k,i,j+1})-J(\Pi_{k,i,j})\big{]}\ \geq\ 0,

J(\pi_{k+1})-J(\pi_{k})=\sum_{i=1}^{n}\sum_{j=0}^{K_{i}-1}\big{[}J(\Pi_{k,i,j+1})-J(\Pi_{k,i,j})\big{]}\ \geq\ 0,

J (\overset{π}{ˉ}) = J (j \to \infty lim π_{k_{j}}) = j \to \infty lim J (π_{k_{j}}) = \overset{ˉ}{J} .

J (\overset{π}{ˉ}) = J (j \to \infty lim π_{k_{j}}) = j \to \infty lim J (π_{k_{j}}) = \overset{ˉ}{J} .

R L (θ ∣ J, θ_{n}, θ_{frozen}, μ) .

R L (θ ∣ J, θ_{n}, θ_{frozen}, μ) .

R L (θ ∣ J, θ_{frozen}, θ_{i}, μ),

R L (θ ∣ J, θ_{frozen}, θ_{i}, μ),

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

Clear role decomposition (auditable plan to act chain). Practical pipeline that reuses mature single-agent RL tooling. Theoretical footing (safety bound, round-wise monotonic improvement). Solid empirical wins for low-level execution and planner replacement. Transferable beyond GUI with modest data.

Weaknesses

Training always freezes one agent (Navigator or Interactor) and optimizes the other, effectively decomposing the multi-agent problem into a sequence of single-agent GRPO/PPO/TRPO steps. This is closer to modular alternating optimization than classic concurrent cooperative MARL. As a result, non-stationarity and coordination that arise when both agents evolve simultaneously are not directly learned. The GUI evaluation relies on offline single-step accuracy (Type/GR/SR)—e.g., a click landing in t

Reviewer 02Rating 4Confidence 2

Strengths

**Impressive empirical results** - SWIRL outperforms or achieves near state-of-the-art in a variety of benchmarks. This is a major strength of the paper. **Problem is important** - Mobile GUI control is a useful area of research, so I believe the paper is well-motivated and will be of use to the community. **Transfer beyond GUI** - I appreciate that this paper extends its results to more than one benchmark, namely mathematical benchmarks. Validity of the work is improved by showing results are

Weaknesses

- **Support for $>2$ agents** - While the proposed SWIRL framework is reported to support N different agents, all experiments focus on settings with only two agents. Do you have any results demonstrating a setting with more than 2 agents? - **Walltime** - Currently, the paper does not disclose any information about the wall time of this method. I would imagine that the alternating nature of the approach may increase this substantially. - **Limitations Section?** - There appears to be no limit

Reviewer 03Rating 2Confidence 4

Strengths

- Interesting idea - Good experiments on replacing the Navigator with SOTA models such as GPT5 etc. to get Interactor performance

Weaknesses

- It is not clear to me why this is formulated as a multi-agent problem, as the setup is stated as a "...sequential single-agent... enabling reuse of standard RL..." - I believe the main weakness of the paper is the experiment design to answer the question above. In the experiment, I would have expected the following experiments: 1. End-to-end RL training of Navigator and Interactor, potentially sharing rewards 2. Isolated training of the Navigator until convergence and then isolated training o

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

Quanfeng Lu111footnotemark: 1 , Zhantao Ma1 , Shuai Zhong1 , Jin Wang1 , Dahai Yu3 , Michael K. Ng2 , Ping Luo1

1The University of Hong Kong 2Hong Kong Baptist University

3TCL Corporate Research (Hong Kong) Co., Ltd

https://github.com/Lqf-HFNJU/SWIRL Equal ContributionCorresponding Author: [email protected]

Abstract

The rapid advancement of large vision language models (LVLMs) and agent systems has heightened interest in mobile GUI agents that can reliably translate natural language into interface operations. Existing single-agent approaches, however, remain limited by structural constraints. Although multi-agent systems naturally decouple different competencies, recent progress in multi-agent reinforcement learning (MARL) has often been hindered by inefficiency and remains incompatible with current LVLM architectures. To address these challenges, we introduce SWIRL, a staged workflow for interleaved reinforcement learning designed for multi-agent systems. SWIRL reformulates MARL into a sequence of single-agent reinforcement learning tasks, updating one agent at a time while keeping the others fixed. This formulation enables stable training and promotes efficient coordination across agents. Theoretically, we provide a stepwise safety bound, a cross-round monotonic improvement theorem, and convergence guarantees on return, ensuring robust and principled optimization. In application to mobile GUI control, SWIRL instantiates a Navigator that converts language and screen context into structured plans, and an Interactor that grounds these plans into executable atomic actions. Extensive experiments demonstrate superior performance on both high-level and low-level GUI benchmarks. Beyond GUI tasks, SWIRL also demonstrates strong capability in multi-agent mathematical reasoning, underscoring its potential as a general framework for developing efficient and robust multi-agent systems.

1 Introduction

With the rapid progress of large vision–language models (LVLMs) (OpenAI, 2025; Zhu et al., 2025; Bai et al., 2025; Guo et al., 2025), increasing attention has been devoted to mobile GUI agents capable of translating natural language instructions into reliable interface manipulation (Qin et al., 2025; Xu et al., 2024; Wu et al., 2024b; Lu et al., 2024; Luo et al., 2025). These agents ground user instructions in the current screenshot and interaction history, reason over this evolving state, and iteratively generate the next action until the task is completed. Effective mobile GUI control depends on two key competencies: task planning, which forms global, goal-conditioned decisions under evolving contexts, and task execution, which translates these plans into executable actions with precise localization. Most existing systems adopt a single-agent design, which complicates the robust integration of these competencies.

We identify two fundamental challenges. First, coupling high-level planning with fine-grained perception and precise actuation makes single end-to-end policies prone to interference and brittleness (Wang et al., 2024; Erdogan et al., 2025; Mo et al., 2025; Wang et al., 2025). Second, end-to-end systems often exhibit a weak linkage between reasoning traces and executed actions, sometimes producing correct outcomes for spurious reasons or plausible traces paired with faulty actions (Turpin et al., 2023; Arcuschin et al., 2025; Li et al., 2024a), thereby undermining safety and accountability in assurance-critical applications (Zhang et al., 2025; Shi et al., 2025; Kuntz et al., 2025).

A multi-agent design provides a principled approach to decoupling core competencies by assigning planning and execution to specialized agents. Beyond this division of labor, structured interactions between agents further enhance transparency by making the reasoning process more interpretable and the resulting actions more auditable. This explicit linkage between decision-making and execution not only improves accountability but also facilitates supervision and error analysis, which are essential for building reliable GUI control systems. However, training-free adaptation of generic LVLMs rarely suffices for domain-specific GUI control due to insufficient cooperation (Niu et al., 2024). Naive multi-agent reinforcement learning (MARL) further introduces practical challenges: joint optimization of multiple policies inflates compute budgets and limited capabilities (Wang et al., 2022; Gogineni et al., 2023). Meanwhile, high-throughput reinforcement learning (RL) frameworks developed for LVLMs are almost exclusively engineered for single-agent training (Hu et al., 2024; Sheng et al., 2025), making them ill-suited for MARL (Liao et al., 2025). These limitations motivate a central question: can we train multi-agent systems that are resource-friendly while ensuring provable stability during training?

We introduce SWIRL, a staged workflow for interleaved reinforcement learning. SWIRL decomposes multi-agent training into two phases: independent pre-warming of each module, followed by alternating optimization where one module is frozen while the other is updated. During this alternating process, we further incorporate an online reweighting mechanism to enhance training stability and accelerate convergence. SWIRL in Fig. 2(c) updates exactly one agent at a step. After several updates on one agent, it then switches to the next, turning joint optimization into sequential single-agent problems and enabling reuse of standard RL toolchains in agent training while maintaining effective coordination. Beyond practicality, we offer theoretical and system-level benefits: we establish a per-step safety bound, prove monotonic improvement across rounds, and derive a corollary for return convergence. In implementation, SWIRL requires only the currently updated agent to be resident on the training device, yielding $O(1)$ actor memory usage, smooth compatibility with standard stacks, and support for heterogeneous model sizes and update budgets, contrasting with the $O(N)$ memory usage of other methods. Table 1 details the count of actor parameters loaded during training without extra optimizations.

We instantiate SWIRL on mobile GUI control with a dual-agent architecture. The Navigator interprets instructions, interaction history, and the current screenshot to form a task context and produce structured low-level instructions (LLI). The Interactor consumes the LLI together with the current UI view and outputs atomic actions such as click, scroll, and text input with precise localization.

Fig. 1 presents the inference pipeline. Training alternates between the two agents: when optimizing the Navigator, the Interactor is fixed and executes the Navigator’s instructions to yield actions and rewards; when optimizing the Interactor, the Navigator remains fixed and supplies instructions for each step. This instantiation separates planning from execution and enforces a tight linkage between reasoning and action. With this decoupled design and the stability of interleaved updates, our system attains state-of-the-art zero-shot performance on both high-level and low-level mobile GUI benchmarks using only $3{,}500$ training examples, outperforming baselines trained under diverse SFT and RL regimes. We further apply the same interleaved recipe to a mathematics setting following prior multi-agent work (Liao et al., 2025) and observe significant gains, including a $+14.8$ improvement on MATH500.

The contributions of this paper are as follows: (i) we introduce SWIRL, a multi-agent training framework that interleaves single-agent updates and transforms MARL to a sequence of single-agent RL problems; it achieves $O(1)$ actor memory usage by loading only the currently updated agent, and accommodates heterogeneous model sizes, data schedules, and update budgets; (ii) we establish formal guarantees, including a per step safety bound, a monotonic improvement result across rounds, and convergence of returns; (iii) we instantiate SWIRL for mobile GUI control with a Navigator and an Interactor and, through extensive experiments, show stable training and state-of-the-art zero-shot performance, together with ablations that clarify when interleaving helps; and (iv) we demonstrate transferability by applying the same training recipe to a non-GUI domain (e.g., mathematics) and observe consistent gains on standard benchmarks, indicating potential for broader multi-agent applications.

2 Related Work

Reinforcement Learning for Mobile GUI Control

Reinforcement learning (RL) has recently emerged as a promising paradigm for GUI tasks. Unlike supervised fine-tuning (SFT), which requires large-scale annotated data, RL can learn effective policies from comparatively fewer samples while exhibiting stronger generalization to new tasks (Chu et al., 2025). A number of recent studies have investigated RL for GUI grounding, with research directions ranging from reward function design to policy optimization (Lu et al., 2025; Luo et al., 2025; Liu et al., 2025; Zhou et al., 2025; Yuan et al., 2025; Tang et al., 2025; Lee et al., 2025). By contrast, applications of RL to more complex mobile control scenarios that involve executing coarse-grained natural language instructions remain relatively limited (Luo et al., 2025). Furthermore, existing work has predominantly adopted single-agent settings, leaving multi-agent approaches largely unexplored.

Multi-Agent Systems Based on Large Language Models

A parallel line of research explores multi-agent systems powered by LLMs to address complex tasks (Hu et al., 2025; Xiao et al., 2024; Xiang et al., 2025; Zhao et al., 2024b; Wu et al., 2024a; Zhao et al., 2024a; Du et al., 2023). These systems typically assign specialized roles such as debating, voting, or negotiation, thereby structuring interactions and facilitating coordination without training. To enhance robustness and mitigate distribution shift (Han et al., 2024), recent studies have proposed strategies including persuasion-aware training (Stengel-Eskin et al., 2024) and iterative self-improvement via SFT (Subramaniam et al., 2025; Zhao et al., 2025). An important challenge is how to effectively train model cooperation when only limited training data is available (Tran et al., 2025).

Multi-Agent Reinforcement Learning

Works in Multi-Agent Reinforcement Learning (MARL) largely falls into value-based methods (Sunehag et al., 2017; Rashid et al., 2020) and actor-critic approaches (Chu et al., 2019; De Witt et al., 2020). A core challenge is non-stationarity, as one agent’s update changes others’ observations. Alternating optimization (e.g., A2PO (Wang et al., 2023), HARL (Zhong et al., 2024)) update agents sequentially at the step level but still face scalability issues (Canese et al., 2021; Tran et al., 2025). MARFT (Liao et al., 2025) combines MARL with LLMs for mathematical problem but suffers from gradient conflicts and parameter drift as the scale increases. It further argues that unifying MARL and LLMs is harder than addressing either alone, highlighting the need for scalable frameworks to integrate them efficiently.

3 Method

In Sec. 3.1, we introduce a theoretical multi-agent interleaved updating methodology in Alg. 1 and give the theoretical guarantees. In Sec. 3.2, we formulate the multi-agent framework for GUI navigation. Following that, Sec. 3.3 introduces SWIRL, a practical mobile GUI implementation of the multi-agent interleaved updating method, executed as a two-phase process involving warm-up and round-level alternating RL with online reweighting.

3.1 Preliminary of Interleaved Updating

The integration of MARL and LVLMs poses challenges, as each introduces unique complexities while being closely interlinked. Similar frictions have long been recognized in mathematical optimization. The Alternating Direction Method of Multipliers (ADMM) (Boyd et al., 2011) provides an effective strategy for tackling challenges formed by complex coupled problems: it decomposes an interconnected problem into manageable subproblems that are solved in turns. This divide-and-conquer approach is widely used in optimization, especially advantageous for challenging non-convex or non-differentiable objectives that can be decomposed into feasible subproblems (Glowinski, 2014; Yang et al., 2022). Consider the following constrained optimization problem $\min_{x,y\in\mathbb{R}^{n}}f(x)+g(y)\text{ s.t. }Ax+By=c$ and its augmented Lagrangian form: $\mathcal{L}_{\rho}(x,y,\lambda)=f(x)+g(y)+\lambda^{\top}(Ax+By-c)+\frac{\rho}{2}\|Ax+By-c\|^{2}$ . To solve this, ADMM performs alternating optimization via the following iterative process: $x^{k+1}:=\arg\min_{x}\mathcal{L}_{\rho}(x,y^{k},\lambda^{k});y^{k+1}:=\arg\min_{y}\mathcal{L}_{\rho}(x^{k+1},y,\lambda^{k});\lambda^{k+1}:=\lambda^{k}+Ax^{k+1}+By^{k+1}-c$ . It exemplifies problem decomposition and alternating optimization: rather than tackling a complex problem, alternating between simpler subproblems can achieve the overall objective.

From ADMM to Multi-Agent Interleaved Updating.

In practical ADMM, each subproblem is solved by an inner loop to a prescribed accuracy before the outer iteration advances. The key is that these inner steps deliver enough improvement for the outer objective to make steady progress. Therefore, we propose a multi-agent round-level interleaved updating training scheme in Alg. 1. The algorithm first independently pre-warms each agent using any single-agent method to obtain initial policies. It then performs interleaved updates: one agent is continuously optimized while the others are fixed, then sequentially switches to the next agent until all are updated, repeating this cycle to decompose MARL into a sequence of single-agent optimization tasks. This structure leads to the following findings: Proposition 1 provides a safety bound for each micro-step, Theorem 1 confirms that the return increases consistently across rounds, and Corollary 1 says the returns converge, and all policy limits attain this value. A summary of notation can be found in Appendix A.1, with complete proofs located in Appendix A.2.

Proposition 1 (Lower bound at a micro-step).

In round $k$ , if agent $\pi^{\,i}_{k,j}$ updates to $\pi^{\,i}_{k,j+1}$ , the new joint policy is $\Pi_{k,i,j+1}:=(\tau^{-i}_{k},\ \pi^{\,i}_{k,j+1})$ , and the performance satisfies

[TABLE]

Theorem 1 (Monotonic improvement).

Every micro-step updates in Alg. 1 satisfies $F_{k,i,j}(\pi^{\,i}_{k,j+1})\geq 0$ , and for all outer rounds $k$ we have $J(\pi_{k+1})\geq J(\pi_{k})$ .

Corollary 1 (Return convergence).

The sequence $\{J(\pi_{k})\}$ approaches a limit, referred to as $\bar{J}$ , and the collection of limit points from $\{\pi_{k}\}$ is non-empty. For any subsequence $\{\pi_{k_{j}}\}_{j\geq 0}$ that converges such that $\pi_{k_{j}}\rightarrow\bar{\pi}$ , it holds that $J(\bar{\pi})=\bar{J}$ .

Alg. 1 provides a methodology: in each round, a single active agent performs several micro-steps while all other agents are frozen. Since the baseline stays constant throughout these micro-steps, each step simplifies to a typical single-agent policy update. As a result, the surrogate goal $L-D_{\mathrm{KL}}^{\max}$ can be estimated using well-established single-agent techniques like TRPO (Schulman et al., 2015), PPO (Schulman et al., 2017), and GRPO (Shao et al., 2024), which apply feasible trust region or clipped KL updates. Previous studies have verified both the theoretical validity and empirical effectiveness of this approximation (Schulman et al., 2015; Zhong et al., 2024). We then apply this methodology to the GUI navigation task.

3.2 Multi-Agent Framework for GUI Control

Task Formulation.

We formulate GUI control task as a sequential decision-making problem. With a natural language instruction $I$ , the agent reviews a series of historical screenshots and actions $H_{t}=\{X_{t-\delta_{s}},\dots,X_{t-1},r^{a}_{t-\delta_{a}},\dots,r^{a}_{t-1}\}$ along with the current screenshot $X_{t}$ to craft a structured text reply $r_{t}$ at each time step $t$ . Here, $\delta_{s}$ and $\delta_{a}$ denote the counts of past screenshots and actions, respectively. This reply includes a reasoning ( $r^{r}_{t}$ ) and a low-level instruction (LLI, $r^{i}_{t}$ ) that outlines the next planned step. The LLI is then succeeded by the actual action ( $r^{a}_{t}$ ), where $r_{t}\sim\pi_{\theta_{n},\theta_{i}}(r\mid I,H_{t},X_{t}),\ \{r^{r}_{t},r^{i}_{t},r^{a}_{t}\}\in r_{t}.$ The objective is to generate the next action $r^{a}_{t}$ that complies with the given instruction $I$ . Appendix E presents illustrative examples of GUI agents completing GUI control tasks. Nevertheless, navigating GUI-based instructions introduces specific challenges: it necessitates high-level navigation to deduce the next-step instruction and detailed perception to engage with UI components, each demanding distinct skills.

Architecture and Training Objective.

To proficiently manage the complexities of GUI operations, we utilize a multi-agent system that distinctly separates the responsibilities between the Navigator ( $\pi_{\theta_{n}}$ ) and the Interactor ( $\pi_{\theta_{i}}$ ). The Navigator is responsible for high-level planning, where it interprets the natural language instructions, merges past actions with the UI views, and establishes a coherent task context with reasoning. It then creates a detailed LLI, reflecting the intended next actions, based on the reasoning process. Subsequently, the Interactor combines the LLI and the current UI view to generate concrete atomic actions, including actions like click, scroll, etc. This involves precise cursor positioning and visual interpretation to ensure accurate execution of the planned steps within the interface. The inference pipeline of the system is illustrated in Fig. 1.

Building on the system architecture above, we train the two agents with a round-level interleaved scheme (Alg. 1). In each round, we select one role, either the Navigator or the Interactor, and run multiple inner updates on its parameters while freezing the other agent. We then swap roles and repeat. Optimizing the theoretical update for an individual agent demands the calculation of advantage $A$ , surrogate $L$ , and $D_{\mathrm{KL}}^{\max}$ , which is cost-prohibitive. Consequently, we implement practical relaxations, similar to (Zhong et al., 2024), by approximating the theoretical goal using a GRPO objective (Shao et al., 2024). Concretely, we calculate single-agent improvements using group-relative advantages, denoted as $A_{k}$ , which are derived from multiple rollouts (standardized across the batch). We control intractable $D_{\mathrm{KL}}^{\max}$ with two tractable terms: clipped ratios around $\pi_{\text{old}}$ and a KL anchor to $\pi_{\text{ref}}$ , ensuring local trust-region control and curbing drift for stable. This preserves the ascent direction of the theoretical target $L-D_{\mathrm{KL}}^{\max}$ , whilst ensuring stability and efficiency. Each agent’s action is an autoregressive sequence, responses for rollouts $K$ are $\{r_{k,\ell}\}_{1\leq k\leq K}^{1\leq\ell\leq|r_{k}|}$ . We compute token-wise importance ratios aligned with GRPO, in line with (Luo et al., 2025). Each token is assigned to either the navigator or interactor by the indicator $\mathbb{I}^{(j)}_{k,\ell}=1$ if and only if $r_{k,\ell}\sim\pi_{\theta_{j}}$ (and [math] otherwise), securing agent-specific credit assignment without breaching the “freeze-the-complement” rule. In conclusion, our overarching multi-agent training objective is:

[TABLE]

The clipped surrogate is: $\operatorname{clip}(v,A)=\min\left(vA,\,\text{clip}(v,1{-}\epsilon,1{+}\epsilon)A\right),$ where the value of importance ratio is: $v^{(j)}_{k,\ell}=\frac{\pi_{\theta_{j}}(r_{k,\ell}\mid I,H_{t},X_{t},r_{k,<\ell})}{\pi^{\text{old}}_{\theta_{j}}(r_{k,\ell}\mid I,H_{t},X_{t},r_{k,<\ell})}.$ The scalar reward is composed of: $R_{k}=\alpha R_{\text{form}}+\beta R_{\text{acc}},$ where $R_{\text{form}}$ denotes the reward for format correctness (e.g., proper usage of required tags), and $R_{\text{acc}}$ is the reward for action accuracy, defined as $R_{\text{acc}}=\lambda_{1}R_{\text{act}}+\lambda_{2}R_{\text{info}}$ , where $R_{\text{act}}$ measures the correctness of the predicted action type, and $R_{\text{info}}$ quantifies the accuracy of the action parameters (e.g., the click location). Finally, the normalized advantage is computed as $A_{k}=\frac{R_{k}-\mu}{\sigma}$ , where $\mu$ and $\sigma$ are the mean and standard deviation of rewards across sampled trajectories.

3.3 SWIRL: Staged Workflow for Interleaved Reinforcement Learning

Building on the multi-agent architecture and learning objective described above, we introduce SWIRL, a Staged Workflow for Interleaved Reinforcement Learning. This approach is crafted to efficiently coordinate and enhance the performance of both the Navigator and Interactor concurrently, as demonstrated in Fig. 3 and Alg. 2. The benefits of SWIRL are listed in Appendix. B.2.

Stage 1: Warm-up initialization. The Navigator is first initialized through lightweight Chain-of-Thought (Wei et al., 2022) SFT, while the interactor is bootstrapped via initial reinforcement learning. The primary goal of this stage is to let each agent clearly learn its designated role: the navigator focuses on producing reasoning steps and LLI, whereas the interactor outputs the corresponding action. This warm-up phase establishes a robust foundation and minimizes variability at the outset.

Stage 2: Interleaved update. After warm-up, SWIRL progresses into a round-level alternating training stage. Within each round, one agent undergoes continuously optimization via reinforcement learning, while the other is kept static. This approach simplifies the intertwined multi-agent learning challenge into a series of static single-agent tasks. Therefore, it allows us to directly leverage modern single-agent RL algorithms, such as GRPO (see Alg. 3), thereby decreasing implementation complexity while preserving training stability and cooperative efficiency. In the Navigator training process, the Navigator generates a reply $r=\{r^{r},r^{i}\}$ at each step. The frozen Interactor then executes $r^{i}$ to produce the final action $r^{a}$ . Rewards are computed as described in Sec. 3.2, with the accuracy reward of $r^{a}$ serving as the Navigator’s $R_{\text{acc}}$ . To further enhance training efficiency and scalability, we deploy the Interactor on a separate server and integrate it within the reward module. The Navigator communicates with the Interactor via HTTP requests, enabling efficient reward computation during RL optimization. For Interactor training, we first use the frozen Navigator to generate low-level instructions for each training sample. The Interactor is then optimized via RL using these instructions as input, with rewards also computed according to Sec. 3.2. The training pipelines for the Navigator and Interactor are illustrated in Fig. 3b and Fig. 3c, respectively.

Online reweighting. After computing the GRPO-type advantages for each batch, we exclude those low-quality samples (Meng et al., 2025; Cui et al., 2025) by regulation $\mathcal{R}$ . These low-quality samples typically arise due to collaborator mistakes, noise, or simplicity (Shi et al., 2025). As the model’s performance increases, it might happen that the number of high-quality samples meeting our filtering standard becomes fewer than the batch size, leading to partially filled batches. To address this, we replenish the batch by randomly resampling from the remaining high-confidence instructions. This process produces reliable batches without introducing additional rollout cost, implicitly up-weights the informative samples to enhance training stability and convergence.

4 Experiment

The experimental setup is described in Sec. 4.1, and the complete experimental details are provided in Appendix C. Sec. 4.2 and Sec. 4.3 present evaluations of SWIRL’s zero-shot performance on high-level and low-level tasks, respectively. Sec. 4.4 investigates the proposed multi-agent training framework in the mathematics domain. Comprehensive ablation studies are reported in Appendix D.

4.1 Settings

Implementation Details.

We use Qwen2.5-VL-3B (Bai et al., 2025) as the base model for both the Navigator and Interactor agents. For historical context, we include only the sequence of past actions, excluding previous screenshots (i.e., $\delta_{a}=t-1$ , $\delta_{s}=0$ ). The weighting coefficients in the reward function in Sec. 3.2 are set to $\alpha=0.1$ , $\beta=0.9$ , $\lambda_{1}=0.2$ , and $\lambda_{2}=0.8$ . For online reweighting, we define the rule $\mathcal{R}$ as: $\mathcal{R}=\begin{cases}\text{keep},&0.1<\overline{R_{x}}<1,\\ \text{discard},&\text{otherwise.}\end{cases},$ where $\overline{R_{x}}$ denotes the average reward of sample $x$ across rollouts.

Training.

We collected $1,500$ and $2,000$ mobile-control samples for Stage 1 and Stage 2, respectively (see Appendix C.1 for details). In Stage 1 (warm-up), the Navigator is trained for $1$ epoch with SFT and the Interactor for $5$ epochs with GRPO (Shao et al., 2024), both using a learning rate of $1\times 10^{-6}$ , batch size $32$ , and DeepSpeed ZeRO-1 for the Navigator; the Interactor uses $5$ rollouts per sample. In Stage 2 (SWIRL alternating updates), the two agents are alternately trained for $2$ epochs each per round over 20 rounds, maintaining the same hyperparameters; the Navigator uses $8$ rollouts per sample and the Interactor $5$ . Stage 1 runs on $8\times$ NVIDIA A800 GPUs; Stage 2 uses $16\times$ A800s, with half for Interactor deployment as a vLLM-based (Kwon et al., 2023) inference service and half for training. The training framework builds on Qwen2.5-VL111https://github.com/QwenLM/Qwen2.5-VL/tree/main/qwen-vl-finetune (Bai et al., 2025) for SFT and VeRL222https://github.com/volcengine/verl (Sheng et al., 2025) for RL.

Evaluation.

We evaluate SWIRL on high-level tasks (AndroidControl-High (Li et al., 2024b), GUIOdyssey (Lu et al., 2024)) and low-level tasks (AndroidControl-Low (Li et al., 2024b), GUI-Act (Chen et al., 2024), OmniAct (Kapoor et al., 2024)). Following the previous work (Wu et al., 2024b; Luo et al., 2025), we report Type, GR, and SR in a zero-shot prompt setting to assess out-of-domain generalization. Appendix C.2 provides detailed information. All evaluations are conducted in a zero-shot prompt setting to assess the models’ out-of-domain generalization ability.

4.2 High-Level Task Zero-shot Performance

We benchmark our approach against models trained under different paradigms and observe substantial gains. As shown in Table 2, our multi-agent framework with two 3B models achieves state-of-the-art zero-shot performance, surpassing the SFT-trained OS-Atlas-7B(Wu et al., 2024b) by $18.8$ points and the RFT-trained GUI-R1-7B (Luo et al., 2025) by $7.55$ points. When GPT-5 (OpenAI, 2025) is used as a high-level to low-level planner for UI-R1-3B and UI-R1-E-3B (Lu et al., 2025), the scores increase by $17.82$ and $24.86$ points, confirming the advantage of separating planning from execution. Replacing GPT-5 with our Navigator trained using SWIRL and interleaved updates yields a further improvement of $2.93$ points, demonstrating the effectiveness of our training strategy in enhancing coordination and boosting downstream performance.

4.3 Low-Level Task Zero-shot Performance

The Interactor trained with SWIRL interleaved updates can also operate independently as a low-level task executor. Although its low-level instruction inputs during Stage 2 training are generated by the Navigator, it still achieves strong results, recording the highest SR and GR scores ( $78.81$ and $92.20$ ) and the second-highest Type score ( $84.62$ ) on the AndroidControl-Low benchmark (Li et al., 2024b), as shown in Table 4. Notably, despite all $3{,}500$ training samples across both stages originating exclusively from mobile devices, the model achieves an overall score of $85.10$ on low-level tasks in the web- and desktop-based GUI-Act (Chen et al., 2024) and OmniAct (Kapoor et al., 2024) datasets, representing the best average performance (Table 3). This remarkable cross-domain generalization underscores the robustness of our model and provides strong empirical evidence for the effectiveness of our training methodology.

4.4 Multi-Agent Training Framework in the Mathematics Domain

We adapt SWIRL to the mathematics domain to validate the generalizability and transferability of our multi-agent training framework. In line with the prior framework MARFT (Liao et al., 2025), we employ Qwen2.5-Coder-3B-Instruct (Hui et al., 2024) as the base model to construct a dual-agent architecture, which is then trained on the MATH (Hendrycks et al., 2021) training set. Implementation details are provided in Appendix C.3. As shown in Table 5, SWIRL delivers consistent improvements across all benchmarks, achieving a $14.8$ -point gain over MARFT on the in-domain MATH500 test set, and further surpassing it on the cross-domain CMATH (Wei et al., 2023) ( $83.5$ vs. $83.0$ ) and out-of-domain GSM8K (Cobbe et al., 2021) ( $81.4$ vs. $78.7$ ) datasets. These results demonstrate that SWIRL not only excels in its original GUI mobile control domain but also transfers effectively to substantially different problem settings, highlighting its robustness and broad applicability.

5 Conclusion

In this paper, we presented SWIRL, an interleaved reinforcement learning paradigm that reformulates multi-agent training into tractable single-agent updates. The central principle of SWIRL lies in its round-level alternating training strategy, where in each round one agent is continuously optimized through reinforcement learning while the others remain fixed. We provided theoretical guarantees for stable optimization and validated the effectiveness of SWIRL through extensive experiments in both mobile GUI control and multi-agent mathematical reasoning. Looking forward, we hope that SWIRL can inspire new approaches to multi-agent training in broader domains, such as finance and AI for science, where efficient coordination and reliable optimization remain critical challenges.

Appendix A Preliminary and Proofs

A.1 Preliminary with Notation

Environment and policies.

We consider a cooperative Markov game $\langle\mathcal{N},\mathcal{S},\mathcal{A},r,P,d\rangle$ with $n$ agents: $\mathcal{N}=[n]=\{1,\ldots,n\}$ . Let $\mathcal{A}:=\prod_{i\in\mathcal{N}}\mathcal{A}^{i}$ be the joint action space. For each $i\in\mathcal{N}$ , the (stochastic Markov) policy $\pi^{i}(\cdot\mid s)\in\Delta(\mathcal{A}^{i})$ together form the joint policy $\pi=(\pi^{i})_{i\in\mathcal{N}}$ , which induces the joint action distribution $\pi(a\mid s)=\prod_{i\in\mathcal{N}}\pi^{i}(a^{i}\mid s)$ for $a=(a^{i})_{i\in\mathcal{N}}$ . Here $\Delta(\cdot)$ denotes the set of probability distributions over a set. The environment transitions according to $P:\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S})$ , written $s_{t+1}\sim P(\cdot\mid s_{t},a_{t})$ , with initial state $s_{0}\sim d$ and joint reward $r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ . Let $\rho_{\pi}$ be the average state-visitation distribution induced by the joint policy $\pi$ . For agent-wise decomposition we can write $a=(a^{-i},a^{i})$ and denote $\pi^{-i}=(\pi^{j})_{j\neq i}$ .

Joint return $J(\pi)$ .

For discount $\gamma\in[0,1)$ and any fixed initial-state distribution,

[TABLE]

It represents the expected total reward from this point forward under the policy $\pi$ : summing the reward from each subsequent step and applying a discount factor of $\gamma^{t}$ at step $t$ . The notation $\infty$ indicates “assess each future step”; in tasks with a finite horizon, this sum naturally concludes upon task completion. The discount factor $\gamma$ dictates the temporal range: $\gamma=0$ focuses only on immediate rewards, suitable for offline tasks, where $J(\pi)=\mathbb{E}_{\pi}[r(s_{0},a_{0})]$ , whereas a higher value of $\gamma$ emphasizes rewards further in the future.

Rounds, order, and micro-steps.

Outer rounds are indexed by $k=0,1,2,\ldots$ . Agent $i$ executes a block of $K_{i}$ micro-steps indexed by $j=0,\ldots,K_{i}-1$ , and we denote a micro-step by $(k,i,j)$ . The joint policy at the start (resp. end) of round $k$ is $\pi_{k}$ (resp. $\pi_{k+1}$ ). During agent $i$ ’s block, $\pi^{\,i}_{k,j}$ is its temporary policy after $j$ micro-steps, initialized by $\pi^{\,i}_{k,0}=\pi_{k}^{i}$ ; after finishing the block, set $\pi_{k+1}^{i}:=\pi^{\,i}_{k,K_{i}}$ . Other agents are held fixed according to the rolling baseline defined next.

Rolling baseline and complement policy.

The baseline joint policy at micro-step $(k,i,j)$ is

[TABLE]

where the complement policy (all agents except $i$ ) is

[TABLE]

During micro-step $j$ of agent $i$ in the $k$ th round, agents positioned earlier in the sequence ( $r<i$ ) have already adopted their policies for round $(k\!+\!1)$ , the current agent follows its internal iterate $\pi^{\,i}_{k,j}$ , and those positioned later ( $r>i$ ) continue to use their round- $k$ policies.

Value, $Q$ , and advantages.

Define

[TABLE]

The joint advantage is $A_{\pi}(s,a):=Q_{\pi}(s,a)-V_{\pi}(s)$ .

For agent $i$ , define the marginal state–action value and the single-agent advantage by

[TABLE]

$A^{\,i}_{\pi}(s,a^{i})$ is the local (per-state/per-action) improvement if agent $i$ takes $a^{i}$ at $s$ while others act according to $\pi^{-i}$ . A basic identity we use is the zero-mean property $\mathbb{E}_{a^{i}\sim\pi^{i}(\cdot\mid s)}\!\big{[}A^{\,i}_{\pi}(s,a^{i})\big{]}=0$ for every $s$ .

Surrogate improvement.

Given the baseline $\Pi_{k,i,j}$ , the complement $\tau^{-i}_{k}$ , and a candidate policy $\hat{\pi}^{i}$ for agent $i$ , define

[TABLE]

$L$ aggregates the local signal $A^{\,i}$ into a policy-level surrogate by averaging. It satisfies the baseline-zero property $L^{\,i}_{\Pi_{k,i,j}}\!\big{(}\tau^{-i}_{k},\pi^{\,i}_{k,j}\big{)}=0$ , because $\mathbb{E}_{a^{i}\sim\pi^{\,i}_{k,j}(\cdot\mid s)}[A^{\,i}_{\Pi_{k,i,j}}(s,a^{i})]=0$ for all $s$ .

Micro-step objective.

Let $\varepsilon_{k,i,j}:=\max_{s,a}|A_{\Pi_{k,i,j}}(s,a)|$ and $C_{k,i,j}:=\tfrac{4\gamma\,\varepsilon_{k,i,j}}{(1-\gamma)^{2}}$ . Define the max conditional KL by $D^{\max}_{\mathrm{KL}}(\mu,\nu):=\sup_{s}\mathrm{KL}\big{(}\mu(\cdot\mid s)\,\|\,\nu(\cdot\mid s)\big{)}$ , and the per-micro-step surrogate

[TABLE]

$F_{k,i,j}$ scores a candidate update as “local surrogate improvement $L$ minus a KL safety penalty”. Taking the current iterate as the candidate reveals that it possesses the baseline-zero property: $F_{k,i,j}(\pi^{\,i}_{k,j})=0.$ This is because $L$ equals zero due to the zero-mean property of the single-agent advantage under the policy $\pi^{\,i}_{k,j}$ , and $D^{\max}_{\mathrm{KL}}(\mu,\mu)=0$ .

A.2 Theoretical Proofs

Proposition 1. In round $k$ , if agent $\pi^{\,i}_{k,j}$ updates to $\pi^{\,i}_{k,j+1}$ , the new joint policy is $\Pi_{k,i,j+1}:=(\tau^{-i}_{k},\ \pi^{\,i}_{k,j+1})$ , and the performance satisfies

[TABLE]

Proof of Proposition 1.

Set $\text{old}=\Pi_{k,i,j}$ and $\text{new}=\Pi_{k,i,j+1}=(\tau^{-i}_{k},\,\pi^{\,i}_{k,i,j+1})$ . Apply Lemma 6 in (Zhong et al., 2024) with $\pi=\text{old}$ and $\bar{\pi}=\text{new}$ . Because only agent $i$ changes, the multi-agent surrogate in Lemma 6 reduces to $L^{\,i}_{\Pi_{k,i,j}}(\tau^{-i}_{k},\,\pi^{\,i}_{k,j+1})$ , and the max-conditional KL reduces to $D^{\max}_{\mathrm{KL}}(\pi^{\,i}_{k,j},\,\pi^{\,i}_{k,j+1})$ . With $C_{k,i,j}=\tfrac{4\gamma\,\varepsilon_{k,i,j}}{(1-\gamma)^{2}}$ and $\varepsilon_{k,i,j}=\max_{s,a}|A_{\Pi_{k,i,j}}(s,a)|$ , this yields exactly inequality 3. ∎

Theorem 1. Every micro-step updates in Alg. 1 satisfies $F_{k,i,j}(\pi^{\,i}_{k,j+1})\geq 0$ , then for all outer rounds $k$ we have $J(\pi_{k+1})\geq J(\pi_{k})$ .

Proof of Theorem 1.

Leveraging the baseline-zero characteristic, $F_{k,i,j}(\pi^{\,i}_{k,j})=0$ , it follows that $\arg\max_{\pi^{i}}F_{k,i,j}(\pi^{i})\geq 0$ . Therefore, by Proposition 1 and the update rule,

[TABLE]

for every micro-step $(k,i,j)$ . Summing these inequalities in order over all micro-steps within round $k$ gives

[TABLE]

hence $J(\pi_{k+1})\geq J(\pi_{k})$ for all $k$ . ∎

Corollary 1. The sequence $\{J(\pi_{k})\}$ has a limit, denoted as $\bar{J}$ , and the set comprised of limit points of $\{\pi_{k}\}$ is not empty. Additionally, for any convergent subsequence $\{\pi_{k_{j}}\}_{j\geq 0}$ : $\pi_{k_{j}}\!\to\!\bar{\pi}$ , $J(\bar{\pi})=\bar{J}$ .

Proof of Corollary 1.

By Theorem 1, the performance sequence $\{J(\pi_{k})\}_{k\geq 0}$ is nondecreasing. With discount $\gamma\in[0,1)$ and bounded rewards $|r|\leq R_{\max}$ , every policy satisfies $|J(\pi)|\leq R_{\max}/(1-\gamma)$ , so $\{J(\pi_{k})\}$ is bounded above and converges to some $\bar{J}\in\mathbb{R}$ . Furthermore, as in (Zhong et al., 2024), the sequence of policies is bounded, hence it admits a convergent subsequence by the Bolzano-Weierstrass Theorem. Therefore, the set of limit points of $\{\pi_{k}\}$ is nonempty. Let $(\pi_{k_{j}})_{j\geq 0}$ be any subsequence converging to a limit policy $\bar{\pi}$ . By continuity of $J$ in $\pi$ , we have

[TABLE]

∎

Appendix B Details in Method

B.1 Illustration of Algorithm 3

In Alg. 3, $RL(\theta\mid\mathcal{J},\cdot,\cdot,\mu)$ denotes GRPO-type single agent RL based on the objective $\mathcal{J}$ defined in equation 2 and hyperparameter $\mu$ , where the first argument $\theta$ is the parameter to be updated.

When updating the navigator, the interactor parameters are frozen. In this case, the input $\theta_{n}$ serves as the initial setting for the navigator, whereas $\theta_{\text{frozen}}$ pertains to the fixed interactor:

[TABLE]

When updating the interactor, the navigator parameters are frozen. In this case, the input $\theta_{i}$ serves as the initial setting for the interactor, whereas $\theta_{\text{frozen}}$ pertains to the fixed navigator:

[TABLE]

Below, we take the case of freezing the interactor and updating the navigator as an example.

B.2 Benefits of SWIRL

Different from MARL methods such as HAPPO (Zhong et al., 2024) (Fig. 2a), A2PO (Wang et al., 2023), and MARFT (Liao et al., 2025) (Fig. 2b), our approach (Fig. 2c) offers four key advantages:

Seamless compatibility. SWIRL reformulates complex MARL tasks into an alternating sequence of simpler single-agent reinforcement learning problems. This design naturally integrates with modern distributed single-agent RL frameworks (e.g., veRL (Sheng et al., 2025), OPEN-RLHF (Hu et al., 2024)), eliminating intrusive modifications to communication protocols and enabling rapid adaptation to future efficient RL frameworks. Consequently, our divide-and-conquer alternating solution fundamentally resolves the integration challenges between MARL and contemporary LVLM-based training pipelines. 2. 2.

Resource-friendly scalability. Consider a system with $N$ agents, each with model size $|\theta_{i}|$ . Table 1 quantitatively summarizes the number of actor parameters loaded on the device during training. In practice, A2PO, HAPPO, and MARFT keep all agents resident on the training device for local rollout. Under this setting, these methods typically load all actor parameters in the training device during learning, leading to a total parameter size of $\sum_{i=1}^{N}\lvert\theta_{i}\rvert$ and memory consumption that scales linearly with the number of agents $(O(N))$ . In contrast, SWIRL only loads the currently updated agent model locally, while executing the remaining agents as “Model-as-a-Service” modules remotely. As a result, the memory usage of the training device remains constant regardless of $N$ (i.e., $O(1)$ ), greatly reducing hardware requirements and improving scalability for large-scale multi-agent systems. 3. 3.

Adaptation to heterogeneity. Our method allows agents to adopt diverse model architectures and training configurations, such as different numbers of training steps or heterogeneous datasets. Because SWIRL decomposes multi-agent training into per-agent optimization, it enables online, agent-conditioned reweighting that adapts to each agent’s distribution by filtering samples that are low-quality for that agent. As a result, heterogeneous agents receive tailored, high-confidence training signals; experiments on Appendix D.5 empirically confirms reduced uninformative or misleading samples and improved overall training performance. 4. 4.

Stability. The alternating update methodology efficiently addresses the non-stationary challenges present in concurrent multi-agent training, thereby minimizing policy misalignment and distributional shift, as shown in Appendix D.2 and D.5. Experimental results in Fig. 4 consistently demonstrate stable performance enhancements with each alternating phase, thus facilitating effective coordination among multiple agents.

Appendix C Experiment Details

C.1 Training Data

Training Data Collection.

We construct our two-stage training dataset based on the AITW(Rawles et al., 2023) (using the expanded version from AITZ(Zhang et al., 2024)) and AMEX(Chai et al., 2024) (adopting the Aguvis variant(Xu et al., 2024)), as both provide richer low-level annotations. In Stage 1, we leverage Qwen2.5-VL-3B to generate $8$ rollouts for each low-level instruction, then select $1,500$ samples whose average reward falls within the range $[0.3,0.4]$ . These high-quality samples are used in two ways: (1) the low-level instruction data is employed as RL training data for the Interactor, and (2) the reasoning traces associated with each sample are used to construct Chain-of-Thought supervised fine-tuning data with high-level instructions for the Navigator. In Stage 2, we utilize the multi-agent system trained in Stage 1 to generate $8$ rollouts for each high-level instruction, filtering for $2,000$ samples with an average reward below $0.6$ and variance more than $0.175$ . Notably, samples in Stage 2 include only high-level instructions and do not contain additional semantic annotations. In total, we curate $3,500$ high-quality training samples across both stages, supporting robust chain-of-thought reasoning and effective RL optimization for both agents.

Action Space.

Following the action space design style of UI-Tars (Qin et al., 2025), we define dataset-specific action spaces for different GUI benchmarks, as shown in Table 6.

Prompt.

Fig. 8 illustrates the prompts used for the Navigator and Interactor.

C.2 Evaluation Details

Metrics.

Following prior work (Wu et al., 2024b; Luo et al., 2025), we evaluate our models using three standard metrics for GUI agents: Type, GR, and SR, which assess the accuracy of action type prediction, coordinate prediction, and step success rate, respectively. Type measures the exact match between the predicted action type (e.g., click, scroll) and the ground truth. GR evaluates the performance of GUI grounding in downstream tasks. SR (step-wise success rate) is computed by considering a step correct only if both the predicted action type and all associated arguments (e.g., coordinates for a click action) match the ground truth. For click-based actions (e.g., click, long_press), the model must predict both the action type and the target coordinates (x, y). When the ground-truth bounding box is available, a prediction is considered correct if its coordinates fall within the box. If no bounding box is provided, or the prediction lies outside it, correctness is determined by whether the predicted coordinates are within $14\%$ of the screen width from the ground-truth position. For type-based actions (e.g., type, open_app), both the action type and content must match the ground truth. We compute the F1 score between the predicted and reference text, and an action is considered correct if $\text{F1}>0.5$ . For scroll actions, the predicted direction argument (up, down, left, or right) must exactly match the ground truth. For all other actions (e.g., press_enter), correctness requires an exact match between the predicted action and the ground truth.

Baselines.

To ensure a fair comparison of cross-domain generalization, we select baseline models that have not been trained on the corresponding training sets of the evaluation domains. For high-level tasks, the comparison includes four categories of models: the proprietary GPT-4o(OpenAI, 2024); the state-of-the-art OS-Atlas-4B/7B(Wu et al., 2024b), trained via supervised fine-tuning (SFT) on large-scale GUI grounding datasets; models trained with reinforcement fine-tuning (RFT) on GUI grounding datasets, including UI-R1-3B, UI-R1-E-3B(Lu et al., 2025), GUI-R1-3B/7B(Luo et al., 2025), and ReGUIDE-7B(Lee et al., 2025); and a multi-agent approach employing GPT-5(OpenAI, 2025) as the planner to translate high-level instructions into low-level actions. For low-level tasks, the baselines consist of GPT-4o, OS-Atlas-4B/7B, and RFT-trained GUI grounding models, including UI-R1-3B, UI-R1-E-3B, GUI-R1-3B/7B, ReGUIDE-7B, and SE-GUI-7B(Yuan et al., 2025).

Evaluation Datasets.

AndroidControl(Li et al., 2024b) is a mobile control dataset in which each GUI interaction trajectory is annotated with both coarse-grained high-level instructions and fine-grained low-level instructions, along with detailed XML metadata from which the bounding boxes of individual UI elements can be parsed. For click actions, we iterate over all bounding boxes in the XML, identify those containing the ground-truth coordinates, and select the smallest one by area as the candidate bounding box for click action evaluation. GUIOdyssey(Lu et al., 2024) consists of tasks involving cross-application operations, posing a significant challenge for models’ planning abilities. In its latest release333https://huggingface.co/datasets/hflqf88888/GUIOdyssey, each click action is supplemented with the bounding box of the corresponding UI element obtained via SAM2(Ravi et al., 2024) segmentation, which we also use for click action evaluation. Notably, the latest version contains $8{,}834$ samples, compared to $7{,}735$ in the earlier release444https://huggingface.co/datasets/OpenGVLab/GUIOdyssey, with changes in the test set size. To ensure consistency with other baselines, we use the Test-Random split from the earlier version. GUI-Act-Web(Chen et al., 2024) contains web-based interaction data. To better assess GUI manipulation capabilities, we remove several QA-style samples from the test set and retain the remaining samples for evaluation. Bounding boxes are not used for click action evaluation in this dataset. OmniAct(Kapoor et al., 2024) includes both web and desktop interaction data. Due to action space compatibility constraints, we filter out samples whose original action space involves hotkeys and keep the remaining samples for evaluation. Similarly, bounding boxes are not used for click action evaluation in this dataset.

C.3 Mathematics Domains

Implementation Details.

We design a dual-agent framework in which one agent acts as the Teacher, providing a concise outline of the problem-solving approach, and the other acts as the Student, generating the final solution by incorporating the Teacher’s guidance. The detailed prompt formulations for both agents are provided in Fig. 9. Both agents are initialized from the Qwen2.5-Coder-3B-Instruct(Hui et al., 2024) model. For the reward function, we assign a reward of $1$ if the generated answer is correct and [math] otherwise, i.e., $R_{x}=\begin{cases}1,&\text{if answer is correct},\\ 0,&\text{otherwise}.\end{cases}$ . For online reweighting, we define the rule $\mathcal{R}$ as follows: we retain samples whose average reward across multiple rollouts lies strictly between $0.2$ and $0.8$ , i.e., $\mathcal{R}=\begin{cases}\text{keep},&0.2<\overline{R_{x}}<0.8,\\ \text{discard},&\text{otherwise.}\end{cases},$ where $\overline{R_{x}}$ denotes the mean reward of sample $x$ over all rollouts.

Training.

We train the dual-agent framework on the MATH(Hendrycks et al., 2021) training set, which contains $7,500$ samples, by directly initiating Stage 2 of the SWIRL alternating-update procedure and bypassing the warm-up stage. In each round, the Teacher is updated first, followed by the Student, with each agent trained for $1$ epoch per round over a total of $10$ rounds. Both agents use a batch size of $128$ and a learning rate of $1\times 10^{-6}$ . The Teacher and Student perform $4$ and $8$ rollouts per sample, respectively. Training is conducted on $16\times$ NVIDIA A800 GPUs, with half allocated to deploying the Student as a vLLM-based (Kwon et al., 2023) inference service and the remaining half for model training.

Appendix D Ablation Study

D.1 Effect of Interleaved Update

To isolate the effect of SWIRL’s second training stage (interleaved updates), we compare two models trained on the same total dataset of $3,500$ samples. Both start from the Stage 1 warm-up with $1,500$ samples. The first continues training on the remaining $2,000$ samples using the Stage 1 strategy until convergence, while the second proceeds with our Stage 2 alternating update scheme. The only difference between them is the training strategy applied in the second stage. As shown in Table 7, although extended warm-up training with more data yields performance gains, its upper bound remains lower than that achieved by our interleaved update approach ( $61.69$ vs. $63.68$ ), highlighting the latter’s effectiveness.

D.2 Co-evolution of Navigator and Interactor

We investigate the impact of the proposed interleaved update on individual agents within the multi-agent framework. For the Interactor, Table 8 compares its performance on low-level tasks with and without Stage 2 interleaved updates. The results show a substantial improvement of $4.89$ points with Stage 2, particularly in SR (a metric that directly measures the correctness of individual actions and thus serves as a more critical indicator of the Interactor’s precision), which increases by nearly $7$ points. For the Navigator, direct evaluation is less straightforward because its outputs are low-level instructions rather than executable actions. To address this, we adopt an indirect evaluation approach: we feed the Navigator’s outputs into a fully trained Interactor and assess the resulting task performance. As shown in Table 9, pairing the Interactor with a Navigator trained using Stage 2 interleaved updates delivers consistently higher performance on high-level tasks, raising the overall score from $56.32$ to $63.68$ and yielding notable improvements in both GR and SR across benchmarks. This demonstrates that the updated Navigator produces more detailed and accurate low-level instructions from high-level goals. Overall, these findings provide strong evidence that interleaved updates effectively enhance each agent’s ability to fulfill its specific role, enabling them to co-evolve and achieve better collaborative performance within the multi-agent system.

D.3 Stability and Potential

As illustrated in Fig. 4, even with only $2,000$ samples used during the SWIRL’s stage 2, the proposed training paradigm consistently enhances the model’s generalization capability. The model’s performance improves steadily with each training round, demonstrating the stability of the alternating optimization process. Furthermore, its out-of-domain generalization continues to increase in the later rounds, suggesting that the performance upper bound has not yet been reached. These results highlight the effectiveness and scalability of SWIRL for robust multi-agent training in GUI navigation tasks.

D.4 Synergy and Individual Competence in Multi-Agent Training

In multi-agent training, overall system performance depends not only on enhancing the capabilities of individual agents but also on achieving effective coordination among them. To disentangle and quantify the relative contributions of these two factors, we vary the number of alternating training rounds (i.e., the frequency and intensity of inter-agent coordination) and the number of epochs per round (i.e., the depth of single-agent training), while keeping the total number of training epochs constant. As shown in Fig.5(a) and Table 10, increasing the number of alternating rounds, thereby providing more opportunities for inter-agent interaction and iterative policy refinement, consistently leads to greater performance gains than allocating the same budget to deeper single-agent training within fewer rounds. At the same time, Fig. 5(b) indicates that increasing the training depth per round still brings incremental benefits, demonstrating that single-agent optimization remains valuable. Taken together, these results suggest that, under a fixed training budget, inter-agent coordination is the primary driver of performance improvement, while individual agent refinement plays a complementary yet meaningful role.

D.5 Effectiveness Analysis of Online Reweighting

To assess the impact of the online reweighting mechanism, we conduct comparative experiments with and without weighted resampling. As shown in Fig. 7(a), models trained without online reweighting perform significantly worse and even exhibit a degradation trend, underscoring the necessity of this mechanism. We hypothesize that online reweighting dynamically prioritizes informative samples, ensuring that they exert greater influence during training, which is essential for the stable improvement of multi-agent systems. Further analysis, as illustrated in Fig. 6, reveals two important findings. First, the Interactor filters out approximately four times as many samples as the Navigator ( $\sim$$75$ vs. $\sim$$18$ ), indicating a substantial difference in how the same dataset contributes to the learning process of each agent. The online reweighting strategy thus adapts to the evolving capabilities of each agent, assigning dynamic weights to those training samples most beneficial at each stage. Second, it is important to recognize that not all samples filtered for being completely incorrect are attributable solely to the deficiencies of a single agent. Some errors may arise from noisy data or from mistakes originating with another agent (e.g., if the Planner generates an erroneous instruction, the Executor may repeatedly fail to execute the correct action). In these cases, the online reweighting mechanism helps to exclude uninformative or misleading samples, thereby further improving the robustness and effectiveness of the training process.

D.6 Sequential Updates versus Parallel Updates

Our default SWIRL implementation employs a strictly sequential update scheme, in which the Navigator is updated first in each training round, followed by the Interactor. We also explore a parallel update strategy, where both agents are updated simultaneously within each round, rather than following a fixed order, i.e., $\left\{\begin{subarray}{c}\theta_{n}^{(k+1)}\leftarrow RL({\theta_{n}}|\mathcal{J},\theta_{n}^{(k)},\theta_{i}^{(k)},\mu_{n})\\ \theta_{i}^{(k+1)}\leftarrow RL({\theta_{i}}|\mathcal{J},\theta_{n}^{(k)},\theta_{i}^{(k)},\mu_{i})\end{subarray}\right.$ . This approach aligns with the update mechanism of Jacobian ADMM (Yang et al., 2022), which relaxes the sequential constraints in standard ADMM and allows for simultaneous updates, i.e., $\left\{\begin{subarray}{c}x^{k+1}:=\arg\min_{x}\mathcal{L}_{\rho}(x,y^{k},\lambda^{k})\\ y^{k+1}:=\arg\min_{y}\mathcal{L}_{\rho}(x^{k},y,\lambda^{k})\end{subarray}\right.$ , thereby increasing training efficiency. As shown in Fig. 7(b), parallel updates can achieve performance comparable to or surpassing that of sequential updates, even without strict alternation. This finding suggests that relaxing the sequential constraint in alternating multi-agent training can improve efficiency without sacrificing model quality.

Appendix E Examples

Bibliography73

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arcuschin et al. (2025) Iván Arcuschin, Jett Janiak, Robert Krzyzanowski, Senthooran Rajamanoharan, Neel Nanda, and Arthur Conmy. Chain-of-thought reasoning in the wild is not always faithful. ar Xiv preprint ar Xiv:2503.08679 , 2025.
2Bai et al. (2025) Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen 2.5-vl technical report. ar Xiv preprint ar Xiv:2502.13923 , 2025.
3Boyd et al. (2011) Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning , 3(1):1–122, 2011.
4Canese et al. (2021) Lorenzo Canese, Gian Carlo Cardarilli, Luca Di Nunzio, Rocco Fazzolari, Daniele Giardino, Marco Re, and Sergio Spanò. Multi-agent reinforcement learning: A review of challenges and applications. Applied Sciences , 11(11):4948, 2021.
5Chai et al. (2024) Yuxiang Chai, Siyuan Huang, Yazhe Niu, Han Xiao, Liang Liu, Dingyu Zhang, Peng Gao, Shuai Ren, and Hongsheng Li. Amex: Android multi-annotation expo dataset for mobile gui agents. ar Xiv preprint ar Xiv:2407.17490 , 2024.
6Chen et al. (2024) Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language models to versatile gui agents. ar Xiv preprint ar Xiv:2406.11317 , 2024.
7Chu et al. (2019) Tianshu Chu, Jie Wang, Lara Codecà, and Zhaojian Li. Multi-agent deep reinforcement learning for large-scale traffic signal control. IEEE transactions on intelligent transportation systems , 21(3):1086–1095, 2019.
8Chu et al. (2025) Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A comparative study of foundation model post-training. ar Xiv preprint ar Xiv:2501.17161 , 2025.

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

SWIRL: A Staged Workflow for Interleaved Reinforcement Learning in Mobile GUI Control

Abstract

1 Introduction

2 Related Work

Reinforcement Learning for Mobile GUI Control

Multi-Agent Systems Based on Large Language Models

Multi-Agent Reinforcement Learning

3 Method

3.1 Preliminary of Interleaved Updating

From ADMM to Multi-Agent Interleaved Updating.

Proposition 1** (Lower bound at a micro-step).**

Theorem 1** (Monotonic improvement).**

Corollary 1** (Return convergence).**

3.2 Multi-Agent Framework for GUI Control

Task Formulation.

Architecture and Training Objective.

3.3 SWIRL: Staged Workflow for Interleaved Reinforcement Learning

4 Experiment

4.1 Settings

Implementation Details.

Training.

Evaluation.

4.2 High-Level Task Zero-shot Performance

4.3 Low-Level Task Zero-shot Performance

4.4 Multi-Agent Training Framework in the Mathematics Domain

5 Conclusion

Appendix A Preliminary and Proofs

A.1 Preliminary with Notation

Environment and policies.

Joint return J(π)J(\pi)J(π).

Rounds, order, and micro-steps.

Rolling baseline and complement policy.

Value, QQQ, and advantages.

Surrogate improvement.

Micro-step objective.

A.2 Theoretical Proofs

Proof of Proposition 1.

Proof of Theorem 1.

Proof of Corollary 1.

Appendix B Details in Method

B.1 Illustration of Algorithm 3

B.2 Benefits of SWIRL

Appendix C Experiment Details

C.1 Training Data

Training Data Collection.

Action Space.

Prompt.

C.2 Evaluation Details

Metrics.

Baselines.

Evaluation Datasets.

C.3 Mathematics Domains

Implementation Details.

Training.

Appendix D Ablation Study

D.1 Effect of Interleaved Update

D.2 Co-evolution of Navigator and Interactor

D.3 Stability and Potential

D.4 Synergy and Individual Competence in Multi-Agent Training

D.5 Effectiveness Analysis of Online Reweighting

D.6 Sequential Updates versus Parallel Updates

Appendix E Examples

Proposition 1 (Lower bound at a micro-step).

Theorem 1 (Monotonic improvement).

Corollary 1 (Return convergence).

Joint return $J(\pi)$ .

Value, $Q$ , and advantages.