AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Chenglin Yang

arXiv:2605.04785·cs.AI·May 7, 2026

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Chenglin Yang

PDF

TL;DR

AgentTrust is a runtime safety layer for AI agents that intercepts tool calls, providing structured verdicts to prevent unsafe actions like data exfiltration or credential exposure.

Contribution

It introduces a comprehensive runtime safety system combining deobfuscation, risk detection, and LLM-based judgment, along with a new benchmark for evaluating safety in AI agents.

Findings

01

Achieves 95-96.7% verdict accuracy on benchmarks.

02

Detects obfuscated payloads with about 93% accuracy.

03

Operates with low latency suitable for real-time safety checks.

Abstract

Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.