Temporal UI State Inconsistency in Desktop GUI Agents: Formalizing and Defending Against TOCTOU Attacks on Computer-Use Agents
Wenpeng Xu

TL;DR
This paper formalizes UI state inconsistency vulnerabilities in desktop GUI agents, demonstrating attacks and proposing a layered verification defense that effectively detects manipulation with minimal overhead.
Contribution
It introduces a formal model of visual atomicity violations and a layered UI verification method, significantly improving detection of TOCTOU attacks in desktop GUIs.
Findings
PUSV achieves 100% action interception rate in adversarial trials.
Primitive B attack achieves 100% success with zero visual evidence.
Different attack primitives require different detection signals, validating layered defense.
Abstract
GUI agents that control desktop computers via screenshot-and-click loops introduce a new class of vulnerability: the observation-to-action gap (mean 6.51 s on real OSWorld workloads) creates a Time-Of-Check, Time-Of-Use (TOCTOU) window during which an unprivileged attacker can manipulate the UI state. We formalize this as a Visual Atomicity Violation and characterize three concrete attack primitives: (A) Notification Overlay Hijack, (B) Window Focus Manipulation, and (C) Web DOM Injection. Primitive B, the closest desktop analog to Android Action Rebinding, achieves 100% action-redirection success rate with zero visual evidence at the observation time. We propose Pre-execution UI State Verification (PUSV), a lightweight three-layer defense that re-verifies the UI state immediately before each action dispatch: masked pixel SSIM at the click target (L1), global screenshot diff (L2a), and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
