UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo, Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye, Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu, Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng

TL;DR
UI-TARS is an end-to-end native GUI agent that uses only screenshots for perception and interaction, outperforming existing models in GUI tasks through innovative perception, unified action modeling, and iterative training.
Contribution
The paper introduces UI-TARS, a novel GUI agent model that achieves state-of-the-art performance using a perception-driven, end-to-end approach with key innovations in perception, action modeling, and learning.
Findings
UI-TARS outperforms GPT-4o and Claude on multiple benchmarks.
Enhanced perception improves UI understanding and captioning.
Iterative training enables continuous learning and adaptation.
Abstract
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e.g., keyboard and mouse operations). Unlike prevailing agent frameworks that depend on heavily wrapped commercial models (e.g., GPT-4o) with expert-crafted prompts and workflows, UI-TARS is an end-to-end model that outperforms these sophisticated frameworks. Experiments demonstrate its superior performance: UI-TARS achieves SOTA performance in 10+ GUI agent benchmarks evaluating perception, grounding, and GUI task execution. Notably, in the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively). In AndroidWorld, UI-TARS achieves 46.6, surpassing GPT-4o (34.5). UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ByteDance-Seed/UI-TARS-1.5-7Bmodel· 142k dl· ♡ 533142k dl♡ 533
- 🤗ByteDance-Seed/UI-TARS-2B-SFTmodel· 1.1k dl· ♡ 351.1k dl♡ 35
- 🤗ByteDance-Seed/UI-TARS-7B-SFTmodel· 8.4k dl· ♡ 1788.4k dl♡ 178
- 🤗ByteDance-Seed/UI-TARS-72B-SFTmodel· 50 dl· ♡ 2450 dl♡ 24
- 🤗ByteDance-Seed/UI-TARS-7B-DPOmodel· 2.1k dl· ♡ 2252.1k dl♡ 225
- 🤗ByteDance-Seed/UI-TARS-72B-DPOmodel· 561 dl· ♡ 151561 dl♡ 151
- 🤗lmstudio-community/UI-TARS-7B-DPO-GGUFmodel· 421 dl· ♡ 9421 dl♡ 9
- 🤗lmstudio-community/UI-TARS-2B-SFT-GGUFmodel· 253 dl· ♡ 3253 dl♡ 3
- 🤗lmstudio-community/UI-TARS-72B-DPO-GGUFmodel· 63 dl· ♡ 363 dl♡ 3
- 🤗pauljmorris/UI-TARS-7B-DPOmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Business Process Modeling and Analysis
