Visual Grounding Methods for Efficient Interaction with Desktop Graphical User Interfaces
El Hassane Ettifouri, Jessica L\'opez Espejel, Laura Minkova, Tassnim Dardouri, Walid Dahhane

TL;DR
This paper introduces new methods for visual grounding in GUIs using multimodal AI models, enabling more effective interaction with synthetic images for automation and accessibility tasks.
Contribution
It presents two novel approaches, IVGocr and IVGdirect, for object identification in GUIs, along with datasets and a new evaluation metric, CPV.
Findings
IVGocr combines LLM, object detection, and OCR for GUI grounding.
IVGdirect offers an end-to-end multimodal grounding architecture.
The proposed methods and datasets facilitate future GUI interaction research.
Abstract
Most visual grounding solutions primarily focus on realistic images. However, applications involving synthetic images, such as Graphical User Interfaces (GUIs), remain limited. This restricts the development of autonomous computer vision-powered artificial intelligence (AI) agents for automatic application interaction. Enabling AI to effectively understand and interact with GUIs is crucial to advancing automation in software testing, accessibility, and human-computer interaction. In this work, we explore Instruction Visual Grounding (IVG), a multi-modal approach to object identification within a GUI. More precisely, given a natural language instruction and a GUI screen, IVG locates the coordinates of the element on the screen where the instruction should be executed. We propose two main methods: (1) IVGocr, which combines a Large Language Model (LLM), an object detection model, and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Advanced Data Processing Techniques
MethodsFocus
