DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine,, Aviral Kumar

TL;DR
DigiRL introduces a novel autonomous reinforcement learning method to train device control agents in real-world GUIs, significantly outperforming previous supervised and RL approaches by leveraging a two-stage offline and online training process.
Contribution
The paper presents DigiRL, a new two-stage RL framework that fine-tunes pre-trained vision-language models for in-the-wild device control, addressing real-world stochasticity and non-stationarity.
Findings
Achieved 49.5% success rate on Android in-the-wild tasks, a substantial improvement over prior methods.
Outperformed GPT-4V, CogAgent, and previous RL approaches in success rate.
Established a new state-of-the-art for in-the-wild device control agents.
Abstract
Training corpuses for vision language models (VLMs) typically lack sufficient amounts of decision-centric data. This renders off-the-shelf VLMs sub-optimal for decision-making tasks such as in-the-wild device control through graphical user interfaces (GUIs). While training with static demonstrations has shown some promise, we show that such methods fall short for controlling real GUIs due to their failure to deal with real-world stochasticity and non-stationarity not captured in static observational data. This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents through fine-tuning a pre-trained VLM in two stages: offline RL to initialize the model, followed by offline-to-online RL. To do this, we build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator and develop a simple yet…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsData Stream Mining Techniques · Reinforcement Learning in Robotics
