TL;DR
LaSM is a layer-wise attention scaling method that enhances GUI agent robustness against pop-up attacks without retraining, by aligning attention with task-relevant regions.
Contribution
It uncovers layer-wise attention divergence patterns and introduces LaSM, a novel attention amplification technique that improves defense success without additional training.
Findings
LaSM significantly increases attack defense success rate.
It maintains the model's general capabilities with negligible impact.
Attention misalignment is identified as a core vulnerability.
Abstract
Graphical user interface (GUI) agents built on multimodal large language models (MLLMs) have recently demonstrated strong decision-making abilities in screen-based interaction tasks. However, they remain highly vulnerable to pop-up-based environmental injection attacks, where malicious visual elements divert model attention and lead to unsafe or incorrect actions. Existing defense methods either require costly retraining or perform poorly under inductive interference. In this work, we systematically study how such attacks alter the attention behavior of GUI agents and uncover a layer-wise attention divergence pattern between correct and incorrect outputs. Based on this insight, we propose \textbf{LaSM}, a \textit{Layer-wise Scaling Mechanism} that selectively amplifies attention and MLP modules in critical layers. LaSM improves the alignment between model saliency and task-relevant…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
