TL;DR
ScreenExplorer is a vision-language model trained with a novel exploration strategy to improve generalization and exploration in open, dynamic GUI environments, advancing towards more adaptable AGI systems.
Contribution
Introduces ScreenExplorer, a VLM trained with Group Relative Policy Optimization and a world-model-based curiosity reward for enhanced exploration in open GUI worlds.
Findings
Better environmental adaptation compared to static models
Enhanced exploration capabilities through experience distillation
Scalable approach toward self-improving AGI in complex settings
Abstract
The rapid progress of large language models (LLMs) has sparked growing interest in building Artificial General Intelligence (AGI) within Graphical User Interface (GUI) environments. However, existing GUI agents based on LLMs or vision-language models (VLMs) often fail to generalize to novel environments and rely heavily on manually curated, diverse datasets. To overcome these limitations, we introduce ScreenExplorer, a VLM trained via Group Relative Policy Optimization(GRPO) in real, dynamic, and open-ended GUI environments. Innovatively, we introduced a world-model-based curiosity reward function to help the agent overcome the cold-start phase of exploration. Additionally, distilling experience streams further enhances the model's exploration capabilities. Our training framework enhances model exploration in open GUI environments, with trained models showing better environmental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
