EDGE: Enhanced Grounded GUI Understanding with Enriched   Multi-Granularity Synthetic Data

Xuetian Chen; Hangcheng Li; Jiaqing Liang; Sihang Jiang; Deqing Yang

arXiv:2410.19461·cs.AI·November 5, 2024

EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data

Xuetian Chen, Hangcheng Li, Jiaqing Liang, Sihang Jiang, Deqing Yang

PDF

Open Access

TL;DR

This paper introduces EDGE, a data synthesis framework that automatically generates large-scale, multi-granularity training data from web pages to enhance GUI understanding in vision-language models, reducing manual annotation needs.

Contribution

EDGE provides a novel, automated data generation method from web pages, significantly improving GUI understanding in LVLMs without extensive manual labeling.

Findings

01

Models trained with EDGE data outperform baselines on GUI benchmarks.

02

The approach generalizes well to unseen desktop and mobile environments.

03

Reduces reliance on manual annotation for GUI understanding tasks.

Abstract

Autonomous agents operating on the graphical user interfaces (GUIs) of various applications hold immense practical value. Unlike the large language model (LLM)-based methods which rely on structured texts and customized backends, the approaches using large vision-language models (LVLMs) are more intuitive and adaptable as they can visually perceive and directly interact with screens, making them indispensable in general scenarios without text metadata and tailored backends. Given the lack of high-quality training data for GUI-related tasks in existing work, this paper aims to enhance the GUI understanding and interacting capabilities of LVLMs through a data-driven approach. We propose EDGE, a general data synthesis framework that automatically generates large-scale, multi-granularity training data from webpages across the Web. Evaluation results on various GUI and agent benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Persona Design and Applications