TL;DR
GUI-AIMA introduces an attention-based, coordinate-free fine-tuning method for GUI grounding that leverages intrinsic multimodal attention in large language models, achieving high accuracy with limited data.
Contribution
It proposes a novel attention alignment framework for GUI grounding that is data-efficient and integrates seamlessly with existing models.
Findings
Achieves state-of-the-art accuracy on multiple GUI benchmarks.
Demonstrates high data efficiency with only 509k training samples.
Effectively triggers native grounding capabilities of MLLMs.
Abstract
Graphical user interface (GUI) grounding is a key capability for computer-use agents, mapping natural-language instructions to actionable regions on the screen. Existing Multimodal Large Language Model (MLLM) approaches typically formulate GUI grounding as a text-based coordinate generation task. However, directly generating precise coordinates from visual inputs is challenging and often data-intensive. A more intuitive strategy is to first identify instruction-relevant visual patches and then determine the exact click location within them. Motivated by recent observations that general MLLMs exhibit native grounding ability embedded in their attention maps, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
