GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Shijie Zhou; Viet Dac Lai; Hao Tan; Jihyung Kil; Wanrong Zhu; Changyou Chen; Ruiyi Zhang

arXiv:2511.00810·cs.CV·March 30, 2026

GUI-AIMA: Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding

Shijie Zhou, Viet Dac Lai, Hao Tan, Jihyung Kil, Wanrong Zhu, Changyou Chen, Ruiyi Zhang

PDF

1 Repo 2 Models

TL;DR

GUI-AIMA introduces an attention-based, coordinate-free fine-tuning method for GUI grounding that leverages intrinsic multimodal attention in large language models, achieving high accuracy with limited data.

Contribution

It proposes a novel attention alignment framework for GUI grounding that is data-efficient and integrates seamlessly with existing models.

Findings

01

Achieves state-of-the-art accuracy on multiple GUI benchmarks.

02

Demonstrates high data efficiency with only 509k training samples.

03

Effectively triggers native grounding capabilities of MLLMs.

Abstract

Graphical user interface (GUI) grounding is a key capability for computer-use agents, mapping natural-language instructions to actionable regions on the screen. Existing Multimodal Large Language Model (MLLM) approaches typically formulate GUI grounding as a text-based coordinate generation task. However, directly generating precise coordinates from visual inputs is challenging and often data-intensive. A more intuitive strategy is to first identify instruction-relevant visual patches and then determine the exact click location within them. Motivated by recent observations that general MLLMs exhibit native grounding ability embedded in their attention maps, we propose GUI-AIMA, an attention-based and coordinate-free supervised fine-tuning framework for efficient GUI grounding. GUI-AIMA aligns the intrinsic multimodal attention of MLLMs with patch-wise grounding signals. These signals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sjz5202/GUI-AIMA
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.