UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Wenkang Han; Zhixiong Zeng; Jing Huang; Shu Jiang; Liming Zheng; Longrong Yang; Haibo Qiu; Chang Yao; Jingyuan Chen; Lin Ma

arXiv:2506.11127·cs.CL·November 27, 2025

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

PDF

1 Repo

TL;DR

UITron-Speech introduces an end-to-end speech-based GUI agent that processes speech instructions and screenshots, enhancing accessibility and interaction without relying on text inputs, supported by synthesized datasets and a novel training strategy.

Contribution

It is the first to enable direct speech instruction processing for GUI agents, utilizing synthesized datasets and a mixed-modality training approach to improve performance and accessibility.

Findings

01

Achieves robust performance across multiple benchmarks.

02

Demonstrates superior adaptability to speech-driven interactions.

03

Validates the effectiveness of synthesized datasets and training strategies.

Abstract

Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guirobotron/guirobotron-speech
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.