TL;DR
UITron-Speech introduces an end-to-end speech-based GUI agent that processes speech instructions and screenshots, enhancing accessibility and interaction without relying on text inputs, supported by synthesized datasets and a novel training strategy.
Contribution
It is the first to enable direct speech instruction processing for GUI agents, utilizing synthesized datasets and a mixed-modality training approach to improve performance and accessibility.
Findings
Achieves robust performance across multiple benchmarks.
Demonstrates superior adaptability to speech-driven interactions.
Validates the effectiveness of synthesized datasets and training strategies.
Abstract
Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
