UI-Venus Technical Report: Building High-performance UI Agents with RFT
Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, Yue Wen, Jingya Dou, Fei Tang, Jinzhen Lin, Yulin Liu, Zhenlin Guo, Yichen Gong, Heng Jia, Changlong Gao, Yuan Guo, Yong Deng, Zhenyu Guo

TL;DR
UI-Venus is a high-performance UI agent using multimodal large language models and reinforcement fine-tuning, achieving state-of-the-art results in UI grounding and navigation tasks with only screenshots as input.
Contribution
The paper introduces UI-Venus, a novel UI agent leveraging RFT and a multimodal LLM, with new techniques for data cleaning and trajectory refinement to enhance performance.
Findings
UI-Venus achieves SOTA on UI grounding benchmarks.
UI-Venus outperforms existing models in UI navigation success rate.
Proposed methods improve reasoning and planning in complex UI tasks.
Abstract
We present UI-Venus, a native UI agent that takes only screenshots as input based on a multimodal large language model. UI-Venus achieves SOTA performance on both UI grounding and navigation tasks using only several hundred thousand high-quality training samples through reinforcement finetune (RFT) based on Qwen2.5-VL. Specifically, the 7B and 72B variants of UI-Venus obtain 94.1% / 50.8% and 95.3% / 61.9% on the standard grounding benchmarks, i.e., Screenspot-V2 / Pro, surpassing the previous SOTA baselines including open-source GTA1 and closed-source UI-TARS-1.5. To show UI-Venus's summary and planing ability, we also evaluate it on the AndroidWorld, an online UI navigation arena, on which our 7B and 72B variants achieve 49.1% and 65.9% success rate, also beating existing models. To achieve this, we introduce carefully designed reward functions for both UI grounding and navigation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗inclusionAI/UI-Venus-1.5-2Bmodel· 2.7k dl· ♡ 352.7k dl♡ 35
- 🤗inclusionAI/UI-Venus-Ground-7Bmodel· 275 dl· ♡ 23275 dl♡ 23
- 🤗inclusionAI/UI-Venus-Ground-72Bmodel· 27 dl· ♡ 1327 dl♡ 13
- 🤗inclusionAI/UI-Venus-Navi-7Bmodel· 17 dl· ♡ 1117 dl♡ 11
- 🤗inclusionAI/UI-Venus-Navi-72Bmodel· 15 dl· ♡ 715 dl♡ 7
- 🤗inclusionAI/UI-Venus-1.5-8Bmodel· 4.2k dl· ♡ 244.2k dl♡ 24
- 🤗inclusionAI/UI-Venus-1.5-30B-A3Bmodel· 3.4k dl· ♡ 233.4k dl♡ 23
- 🤗mlx-community/UI-Venus-1.5-8B-bf16model· 13 dl13 dl
- 🤗mlx-community/UI-Venus-1.5-8B-6bitmodel· 10 dl10 dl
- 🤗mlx-community/UI-Venus-1.5-8B-4bitmodel· 21 dl21 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
