MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements
SeokJoo Kwak, Jihoon Kim, Boyoun Kim, Jung Jae Yoon, Wooseok Jang, Jeonghoon Hong, Jaeho Yang, Yeong-Dae Kwon

TL;DR
MEGA-GUI introduces a multi-stage, modular framework for GUI element grounding that significantly improves accuracy over monolithic models by effectively handling visual clutter and semantic ambiguity.
Contribution
It proposes a novel multi-stage approach with specialized agents, including a bidirectional ROI zoom and context-aware rewriting, enhancing GUI grounding performance.
Findings
Achieves 73.18% accuracy on ScreenSpot-Pro benchmark.
Reaches 68.63% accuracy on OSWorld-G benchmark.
Outperforms previous monolithic models in GUI grounding tasks.
Abstract
Graphical User Interface (GUI) grounding - the task of mapping natural language instructions to screen coordinates - is essential for autonomous agents and accessibility technologies. Existing systems rely on monolithic models or one-shot pipelines that lack modularity and fail under visual clutter and ambiguous instructions. We introduce MEGA-GUI, a multi-stage framework that separates grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding, orchestrated by specialized vision-language agents. MEGA-GUI features a bidirectional ROI zoom algorithm that mitigates spatial dilution and a context-aware rewriting agent that reduces semantic ambiguity. Our analysis reveals complementary strengths and weaknesses across vision-language models at different visual scales, and we show that leveraging this modular structure achieves consistently higher accuracy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
