UI-Venus Technical Report: Building High-performance UI Agents with RFT

Zhangxuan Gu; Zhengwen Zeng; Zhenyu Xu; Xingran Zhou; Shuheng Shen; Yunfei Liu; Beitong Zhou; Changhua Meng; Tianyu Xia; Weizhi Chen; Yue Wen; Jingya Dou; Fei Tang; Jinzhen Lin; Yulin Liu; Zhenlin Guo; Yichen Gong; Heng Jia; Changlong Gao; Yuan Guo; Yong Deng; Zhenyu Guo; Liang Chen; Weiqiang Wang

arXiv:2508.10833·cs.CV·August 18, 2025

UI-Venus Technical Report: Building High-performance UI Agents with RFT

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo

PDF

10 Models

TL;DR

UI-Venus is a high-performance UI agent using multimodal large language models and reinforcement fine-tuning, achieving state-of-the-art results in UI grounding and navigation tasks with only screenshots as input.

Contribution

The paper introduces UI-Venus, a novel UI agent leveraging RFT and a multimodal LLM, with new techniques for data cleaning and trajectory refinement to enhance performance.

Findings

01

UI-Venus achieves SOTA on UI grounding benchmarks.

02

UI-Venus outperforms existing models in UI navigation success rate.

03

Proposed methods improve reasoning and planning in complex UI tasks.

Abstract

We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models. To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.