OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models
Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Qiushi Sun, Zhaoyang Liu, Zhoumianze Liu, Yu Qiao, Xiangyu Yue, Zun Wang, Zichen Ding

TL;DR
OS-Oracle introduces a comprehensive framework with a new dataset, training paradigm, and benchmark for developing and evaluating cross-platform GUI critic models, significantly advancing step-level decision-making in GUI navigation agents.
Contribution
The paper presents a scalable data pipeline, a novel two-stage training method, and a holistic benchmark for GUI critic models, addressing data scarcity and evaluation challenges.
Findings
OS-Oracle-7B achieves state-of-the-art results on OS-Critic Bench.
The critic model surpasses proprietary models in mobile domain.
Pre-critic use of OS-Oracle-7B improves native GUI agent performance.
Abstract
With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Software Engineering Methodologies · Artificial Intelligence in Games · Adversarial Robustness in Machine Learning
