OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

Zhenyu Wu; Jingjing Xie; Zehao Li; Bowen Yang; Qiushi Sun; Zhaoyang Liu; Zhoumianze Liu; Yu Qiao; Xiangyu Yue; Zun Wang; Zichen Ding

arXiv:2512.16295·cs.AI·December 19, 2025

OS-Oracle: A Comprehensive Framework for Cross-Platform GUI Critic Models

Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Qiushi Sun, Zhaoyang Liu, Zhoumianze Liu, Yu Qiao, Xiangyu Yue, Zun Wang, Zichen Ding

PDF

Open Access

TL;DR

OS-Oracle introduces a comprehensive framework with a new dataset, training paradigm, and benchmark for developing and evaluating cross-platform GUI critic models, significantly advancing step-level decision-making in GUI navigation agents.

Contribution

The paper presents a scalable data pipeline, a novel two-stage training method, and a holistic benchmark for GUI critic models, addressing data scarcity and evaluation challenges.

Findings

01

OS-Oracle-7B achieves state-of-the-art results on OS-Critic Bench.

02

The critic model surpasses proprietary models in mobile domain.

03

Pre-critic use of OS-Oracle-7B improves native GUI agent performance.

Abstract

With VLM-powered computer-using agents (CUAs) becoming increasingly capable at graphical user interface (GUI) navigation and manipulation, reliable step-level decision-making has emerged as a key bottleneck for real-world deployment. In long-horizon workflows, errors accumulate quickly and irreversible actions can cause unintended consequences, motivating critic models that assess each action before execution. While critic models offer a promising solution, their effectiveness is hindered by the lack of diverse, high-quality GUI feedback data and public critic benchmarks for step-level evaluation in computer use. To bridge these gaps, we introduce OS-Oracle that makes three core contributions: (1) a scalable data pipeline for synthesizing cross-platform GUI critic data; (2) a two-stage training paradigm combining supervised fine-tuning (SFT) and consistency-preserving group relative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Software Engineering Methodologies · Artificial Intelligence in Games · Adversarial Robustness in Machine Learning