UFO: A UI-Focused Agent for Windows OS Interaction
Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua, Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Qi Zhang

TL;DR
UFO is a novel AI agent that uses GPT-Vision to understand and interact with Windows applications through GUIs, automating complex tasks via natural language commands with high effectiveness.
Contribution
UFO introduces the first UI agent tailored for Windows OS, employing a dual-agent framework for GUI analysis and control, enabling fully automated task execution.
Findings
Successfully tested across 9 Windows applications
Outperforms existing methods in task completion effectiveness
Enables natural language-driven automation of GUI tasks
Abstract
We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision. UFO employs a dual-agent framework to meticulously observe and analyze the graphical user interface (GUI) and control information of Windows applications. This enables the agent to seamlessly navigate and operate within individual applications and across them to fulfill user requests, even when spanning multiple applications. The framework incorporates a control interaction module, facilitating action grounding without human intervention and enabling fully automated execution. Consequently, UFO transforms arduous and time-consuming processes into simple tasks achievable solely through natural language commands. We conducted testing of UFO across 9 popular Windows applications, encompassing a variety of scenarios reflective of users'…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMobile Agent-Based Network Management · Context-Aware Activity Recognition Systems · Distributed and Parallel Computing Systems
