ProjGuard: Safety Monitoring for Computer-Use Agents via Low-Dimensional Projections
Kebin Contreras, Carlos Hinojosa, Jorge Bacca, Bernard Ghanem

TL;DR
ProjGuard introduces a behavioral monitoring approach for computer-use agents that detects unsafe trajectories early and activates targeted corrections, enhancing safety and task success.
Contribution
It presents a novel low-dimensional projection-based safety monitoring method that reduces unsafe actions and improves task completion in operating system environments.
Findings
Unsafe rate reduced from 16% to 3% on OS-Harm.
Task completion improved from 59% to 65% on OS-Harm.
Method remains effective with transfer to RiosWorld, achieving 4% unsafe and 64% completion.
Abstract
Computer-use agents are increasingly capable of operating on real operating systems, but this capability has also increased the risks posed by prompt injection, indirect instructions, and visual attacks. Existing defenses typically rely on analyzing the prompt or each potentially malicious input with a second large model at inference time, which can limit coverage or increase deployment cost. We propose ProjGuard, an alternative based on behavioral trajectory monitoring. At each step, we derive a lightweight scalar risk signal from the agent's accumulated interaction history and evaluate, online, whether execution is beginning to drift toward an unsafe region. This enables early warnings before the trajectory reaches a potentially harmful action. When an alert is raised, we selectively activate an auxiliary vision-language model to propose a corrected next step and steer execution back…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
