AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use
Chenglin Yang

TL;DR
AgentTrust is a runtime safety layer for AI agents that intercepts tool calls, providing structured verdicts to prevent unsafe actions like data exfiltration or credential exposure.
Contribution
It introduces a comprehensive runtime safety system combining deobfuscation, risk detection, and LLM-based judgment, along with a new benchmark for evaluating safety in AI agents.
Findings
Achieves 95-96.7% verdict accuracy on benchmarks.
Detects obfuscated payloads with about 93% accuracy.
Operates with low latency suitable for real-time safety checks.
Abstract
Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
