Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

El Hassane Ettifouri; Jessica L\'opez Espejel; Laura Minkova; Tassnim Dardouri; Walid Dahhane

arXiv:2407.01558·cs.HC·July 21, 2025

Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces

El Hassane Ettifouri, Jessica L\'opez Espejel, Laura Minkova, Tassnim Dardouri, Walid Dahhane

PDF

Open Access

TL;DR

This paper introduces new methods for visual grounding in GUIs using multimodal AI models, enabling more effective interaction with synthetic images for automation and accessibility tasks.

Contribution

It presents two novel approaches, IVGocr and IVGdirect, for object identification in GUIs, along with datasets and a new evaluation metric, CPV.

Findings

01

IVGocr combines LLM, object detection, and OCR for GUI grounding.

02

IVGdirect offers an end-to-end multimodal grounding architecture.

03

The proposed methods and datasets facilitate future GUI interaction research.

Abstract

Most visual grounding solutions primarily focus on realistic images. However, applications involving synthetic images, such as Graphical User Interfaces (GUIs), remain limited. This restricts the development of autonomous computer vision-powered artificial intelligence (AI) agents for automatic application interaction. Enabling AI to effectively understand and interact with GUIs is crucial to advancing automation in software testing, accessibility, and human-computer interaction. In this work, we explore Instruction Visual Grounding (IVG), a multi-modal approach to object identification within a GUI. More precisely, given a natural language instruction and a GUI screen, IVG locates the coordinates of the element on the screen where the instruction should be executed. We propose two main methods: (1) IVGocr, which combines a Large Language Model (LLM), an object detection model, and an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Automated Systems · Advanced Data Processing Techniques

MethodsFocus