DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
Yichao Liu, Huawen Shen, Liu Yu, Shiyu Liu, Zeyu Chen, Yu Zhou

TL;DR
DRS-GUI is a training-free framework that improves GUI element grounding by dynamically exploring interface regions using human-like perceptual actions and Monte Carlo Tree Search, enhancing performance of multimodal language models.
Contribution
It introduces a novel, training-free dynamic region search method with a perceptual action planner for better GUI grounding in multimodal models.
Findings
Achieves 14% improvement on ScreenSpot-Pro benchmarks.
Significantly enhances grounding performance and generalization.
Effectively integrates into existing MLLMs without additional training.
Abstract
GUI agents powered by Multimodal Large Language Models (MLLMs) have demonstrated impressive capability in understanding and executing user instructions. However, accurately grounding instruction-relevant elements from high-resolution screenshots cluttered with irrelevant UI components remains challenging for existing approaches. Inspired by how humans dynamically adjust their perceptual scope to locate task-related regions on complex screens, we propose DRS-GUI, a training-free dynamic region search framework for GUI grounding that can be seamlessly integrated into existing MLLMs. DRS-GUI introduces a lightweight UI Perceptor that performs three human-like perceptual actions (Focus, Shift, and Scatter) to progressively explore the interface and generate region proposals. To dynamically schedule these actions, we further design an Action Planner based on Monte Carlo Tree Search (MCTS). A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
