OmniParser for Pure Vision Based GUI Agent
Yadong Lu, Jianwei Yang, Yelong Shen, Ahmed Awadallah

TL;DR
OmniParser is a novel screen parsing technique that enhances vision-language models like GPT-4V to better understand and interact with user interfaces by detecting icons and understanding element semantics, improving performance on UI benchmarks.
Contribution
We introduce OmniParser, a comprehensive method combining icon detection and semantic understanding to improve GPT-4V's ability to interpret and act on user interface screenshots.
Findings
OmniParser significantly improves GPT-4V's performance on the ScreenSpot benchmark.
It outperforms GPT-4V baselines on Mind2Web and AITW benchmarks using only screenshots.
The method includes curated datasets for icon detection and semantic captioning.
Abstract
The recent success of large vision language models shows great potential in driving the agent system operating on user interfaces. However, we argue that the power multimodal models like GPT-4V as a general agent on multiple operating systems across different applications is largely underestimated due to the lack of a robust screen parsing technique capable of: 1) reliably identifying interactable icons within the user interface, and 2) understanding the semantics of various elements in a screenshot and accurately associate the intended action with the corresponding region on the screen. To fill these gaps, we introduce \textsc{OmniParser}, a comprehensive method for parsing user interface screenshots into structured elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. We first…
Peer Reviews
Decision·Submitted to ICLR 2025
- The paper proposes an alternative method for parsing GUI elements in screenshots by using an icon detection model combined with a description model to generate comprehensive information about the elements. - The collected dataset on icon detection and description is expected to be highly beneficial for the community. - The experiments are rich and thorough. And the benchmark performance shows promising results.
1. Since the paper mainly uses existing YOLO-v8 and BLIP v2 models, its primary contribution lies in the proposed icon-detection and description datasets. However, many details regarding dataset construction and data statistics are missing. - The authors mention that the main focus is on collecting an "INTERACTABLE REGION DETECTION" dataset. However, how are elements deemed interactive? The DOM does not directly provide metadata indicating interactivity, only element types. How was this impl
1. The authors conducted comprehensive evaluations on multiple benchmark datasets, with OmniParser achieving state-of-the-art results across all datasets. 2. OmniParser does not rely on additional information like DOM or view hierarchy, making it more generalizable. 3. The authors performed extensive ablation studies validating the effectiveness of the proposed ID and IS modules, and verified the robustness of the entire framework on open-source Llama-3.2-V and Phi-3.5-V models. 4. The author
1. OmniParser's strong performance heavily relies on the powerful backbone model (GPT4-V), and switching to open-source models would significantly decrease its performance. 2. The entire pipeline is not end-to-end, which increases its complexity and inference latency.
1. This paper is well written and easy to follow. The demonstration of the GUI framework is clear. 2. The proposed strategy, which involves extracting the position of interactable elements along with their function descriptions, is both flexible and effective. It significantly enhances GPT-4V’s agentic capabilities, particularly the GUI grounding capabilities of the entire framework. 3. The data curation considers the position and function description of interactable elements, which would be ben
1. The novelty and innovation in this work are limited. The primary approach involves training existing modules for various purposes, such as icon detection and captioning, and then integrating these modules to construct a GUI framework. This does not meet the criteria for ICLR. 2. Missing details. - 2.1 The authors collected data for interactable region detection from web pages. However, in Figure 2, most examples contain text elements, with very few icons shown. If most of the intera
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
