Owner-Harm: A Missing Threat Model for AI Agent Safety
Dongcheng Zhang, Yiqing Jiang

TL;DR
This paper introduces Owner-Harm, a formal threat model for AI agents that harm their deployers, highlighting a significant defense gap and proposing layered detection methods validated through benchmarks and experiments.
Contribution
It formalizes owner-harm as a distinct threat category, quantifies the defense gap, and proposes the SSDG framework for improving detection across different scenarios.
Findings
High detection success for generic criminal harm (100% TPR)
Low detection rate for owner-harm injection tasks (14.8%)
Layered defenses improve overall detection (up to 85.3% TPR)
Abstract
Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-world incidents illustrate the gap: Slack AI credential exfiltration (Aug 2024), Microsoft 365 Copilot calendar-injection leaks (Jan 2024), and a Meta agent unauthorized forum post exposing operational data (Mar 2026). We propose Owner-Harm, a formal threat model with eight categories of agent behavior damaging the deployer. We quantify the defense gap on two benchmarks: a compositional safety system achieves 100% TPR / 0% FPR on AgentHarm (generic criminal harm) yet only 14.8% (4/27; 95% CI: 5.9%-32.5%) on AgentDojo injection tasks (prompt-injection-mediated owner harm). A controlled generic-LLM baseline shows the gap is not inherent to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
