Structuring GUI Elements through Vision Language Models: Towards Action Space Generation
Yi Xu, Yesheng Zhang, Jiajia Liu, Jingdong Chen

TL;DR
This paper proposes an IoU-Augmented Maximum Likelihood training method to improve multimodal large language models in generating precise GUI element coordinates, addressing semantic gaps and enhancing GUI understanding.
Contribution
It introduces a novel IoU-based coordinate sampling pipeline and a fine-tuning paradigm that significantly improves MLLMs' accuracy in GUI element coordinate generation.
Findings
IAML outperforms traditional training methods in coordinate accuracy.
The data augmentation approach enhances model understanding of UI element positions.
Experimental results show improved performance in GUI element structuring tasks.
Abstract
Multimodal large language models (MLLMs) have emerged as pivotal tools in enhancing human-computer interaction. In this paper we focus on the application of MLLMs in the field of graphical user interface (GUI) elements structuring, where they assist in processing user instructions based on screen contents. Despite the promise of MLLMs, their performance in precisely generating UI element coordinates, a critical aspect of GUI understanding, is hindered by the nature of next-token prediction training. This challenge arises from the semantic void surrounding numerical UI coordinates in language representation spaces, necessitating a substantial and diverse dataset to bolster visual module capabilities. To address these limitations, we introduce an IoU-Augmented Maximum Likelihood (IAML) training paradigm. Specifically, our approach involves a novel pipeline for IoU-based coordinate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
