TL;DR
This paper introduces GUI-SD, a novel on-policy self-distillation framework for GUI grounding that improves accuracy and efficiency by leveraging a visually enriched context and entropy-guided token weighting.
Contribution
It is the first OPSD framework specifically designed for GUI grounding, enhancing guidance and focus during training compared to prior reinforcement learning methods.
Findings
GUI-SD outperforms GRPO-based methods in accuracy.
GUI-SD is more training-efficient.
GUI-SD achieves consistent improvements across six benchmarks.
Abstract
Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
