MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

SeokJoo Kwak; Jihoon Kim; Boyoun Kim; Jung Jae Yoon; Wooseok Jang; Jeonghoon Hong; Jaeho Yang; Yeong-Dae Kwon

arXiv:2511.13087·cs.AI·November 18, 2025

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

SeokJoo Kwak, Jihoon Kim, Boyoun Kim, Jung Jae Yoon, Wooseok Jang, Jeonghoon Hong, Jaeho Yang, Yeong-Dae Kwon

PDF

Open Access

TL;DR

MEGA-GUI introduces a multi-stage, modular framework for GUI element grounding that significantly improves accuracy over monolithic models by effectively handling visual clutter and semantic ambiguity.

Contribution

It proposes a novel multi-stage approach with specialized agents, including a bidirectional ROI zoom and context-aware rewriting, enhancing GUI grounding performance.

Findings

01

Achieves 73.18% accuracy on ScreenSpot-Pro benchmark.

02

Reaches 68.63% accuracy on OSWorld-G benchmark.

03

Outperforms previous monolithic models in GUI grounding tasks.

Abstract

Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling