Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large   Language Models without Fine-Tuning

Hai-Ming Xu; Qi Chen; Lei Wang; Lingqiao Liu

arXiv:2412.10840·cs.CV·December 17, 2024

Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning

Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a tuning-free method called TAG that uses attention patterns in pretrained multimodal large language models to accurately identify GUI components without additional training, achieving performance comparable to fine-tuned models.

Contribution

The paper presents a novel attention-driven grounding approach that leverages pretrained MLLMs' inherent attention, eliminating the need for fine-tuning in GUI component localization.

Findings

01

Attention maps effectively localize GUI components.

02

TAG achieves comparable performance to fine-tuned models.

03

Attention-based grounding outperforms direct localization predictions.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

heimingx/tag
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems

MethodsSoftmax · Attention Is All You Need