Proof-of-Guardrail in AI Agents and What (Not) to Trust from It
Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren

TL;DR
This paper introduces proof-of-guardrail, a cryptographic system that verifies AI safety guardrails are correctly executed within a Trusted Execution Environment, enhancing trustworthiness of AI responses while maintaining developer privacy.
Contribution
It presents a novel cryptographic proof system for verifying AI safety guardrails using TEEs, addressing trust issues in AI safety claims.
Findings
Proof-of-guardrail effectively verifies guardrail execution.
System incurs acceptable latency overhead.
Maintains developer privacy while ensuring guardrail integrity.
Abstract
As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
