Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
Hai-Ming Xu, Qi Chen, Lei Wang, Lingqiao Liu

TL;DR
This paper introduces a tuning-free method called TAG that uses attention patterns in pretrained multimodal large language models to accurately identify GUI components without additional training, achieving performance comparable to fine-tuned models.
Contribution
The paper presents a novel attention-driven grounding approach that leverages pretrained MLLMs' inherent attention, eliminating the need for fine-tuning in GUI component localization.
Findings
Attention maps effectively localize GUI components.
TAG achieves comparable performance to fine-tuned models.
Attention-based grounding outperforms direct localization predictions.
Abstract
Recent advancements in Multimodal Large Language Models (MLLMs) have generated significant interest in their ability to autonomously interact with and interpret Graphical User Interfaces (GUIs). A major challenge in these systems is grounding-accurately identifying critical GUI components such as text or icons based on a GUI image and a corresponding text query. Traditionally, this task has relied on fine-tuning MLLMs with specialized training data to predict component locations directly. However, in this paper, we propose a novel Tuning-free Attention-driven Grounding (TAG) method that leverages the inherent attention patterns in pretrained MLLMs to accomplish this task without the need for additional fine-tuning. Our method involves identifying and aggregating attention maps from specific tokens within a carefully constructed query prompt. Applied to MiniCPM-Llama3-V 2.5, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
MethodsSoftmax · Attention Is All You Need
