TinyClick: Single-Turn Agent for Empowering GUI Automation

Pawel Pawlowski; Krystian Zawistowski; Wojciech Lapacz; Adam Wiacek; Marcin Skorupa; Sebastien Postansque; Jakub Hoscilowicz

arXiv:2410.11871·cs.HC·May 22, 2025

TinyClick: Single-Turn Agent for Empowering GUI Automation

Pawel Pawlowski, Krystian Zawistowski, Wojciech Lapacz, Adam Wiacek, Marcin Skorupa, Sebastien Postansque, Jakub Hoscilowicz

PDF

Open Access 1 Models

TL;DR

TinyClick is a compact, efficient UI agent leveraging a vision-language model to accurately identify UI elements based on user commands, with minimal training resources and strong performance on benchmark datasets.

Contribution

Introduces TinyClick, a small-sized UI agent using Florence-2-Base, with innovative multi-task training and data augmentation to reduce resource needs and improve performance.

Findings

01

Achieves strong performance on Screenspot and OmniAct datasets.

02

Operates with only 0.27B parameters and minimal latency.

03

Requires only 56 GPU-hours for training.

Abstract

We present an UI agent for user interface (UI) interaction tasks, using Vision-Language Model Florence-2-Base. The agent's primary task is identifying the screen coordinates of the UI element corresponding to the user's command. It demonstrates very strong performance on Screenspot and OmniAct annotations, while maintaining a very small size of 0.27B parameters and minimal latency. Moreover, training needs small compute budget of 56 GPU-hours (worth about 40 USD). Relevant improvement comes from vision-specific multi-task training and MLLM-based data augmentation. We hope that decreased needs for expensive compute resources and manually annotated data will allow to facilitate more inclusive and sustainable research of UI agents.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
kzawistowsk/TinyClick
model· ♡ 4
♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile and Web Applications · Gaze Tracking and Assistive Technology · IoT-based Smart Home Systems