Owner-Harm: A Missing Threat Model for AI Agent Safety

Dongcheng Zhang; Yiqing Jiang

arXiv:2604.18658·cs.CR·April 22, 2026

Owner-Harm: A Missing Threat Model for AI Agent Safety

Dongcheng Zhang, Yiqing Jiang

PDF

TL;DR

This paper introduces Owner-Harm, a formal threat model for AI agents that harm their deployers, highlighting a significant defense gap and proposing layered detection methods validated through benchmarks and experiments.

Contribution

It formalizes owner-harm as a distinct threat category, quantifies the defense gap, and proposes the SSDG framework for improving detection across different scenarios.

Findings

01

High detection success for generic criminal harm (100% TPR)

02

Low detection rate for owner-harm injection tasks (14.8%)

03

Layered defenses improve overall detection (up to 85.3% TPR)

Abstract

Existing AI agent safety benchmarks focus on generic criminal harm (cybercrime, harassment, weapon synthesis), leaving a systematic blind spot for a distinct and commercially consequential threat category: agents harming their own deployers. Real-world incidents illustrate the gap: Slack AI credential exfiltration (Aug 2024), Microsoft 365 Copilot calendar-injection leaks (Jan 2024), and a Meta agent unauthorized forum post exposing operational data (Mar 2026). We propose Owner-Harm, a formal threat model with eight categories of agent behavior damaging the deployer. We quantify the defense gap on two benchmarks: a compositional safety system achieves 100% TPR / 0% FPR on AgentHarm (generic criminal harm) yet only 14.8% (4/27; 95% CI: 5.9%-32.5%) on AgentDojo injection tasks (prompt-injection-mediated owner harm). A controlled generic-LLM baseline shows the gap is not inherent to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.