Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems
Abhivansh Gupta

TL;DR
This paper introduces a verifiability-first architecture for autonomous LLM agents, combining cryptographic attestations, lightweight audit agents, and challenge protocols to improve controllability and detect misalignment swiftly.
Contribution
It proposes a novel architecture integrating cryptographic and symbolic attestations, lightweight verification agents, and a new benchmark suite for measuring detection and resilience of misalignment.
Findings
Enhanced detection speed of misalignment behaviors
Improved resilience against adversarial prompt injections
Benchmark suite OPERA effectively measures verifiability performance
Abstract
As LLM-based agents grow more autonomous and multi-modal, ensuring they remain controllable, auditable, and faithful to deployer intent becomes critical. Prior benchmarks measured the propensity for misaligned behavior and showed that agent personalities and tool access significantly influence misalignment. Building on these insights, we propose a Verifiability-First architecture that (1) integrates run-time attestations of agent actions using cryptographic and symbolic methods, (2) embeds lightweight Audit Agents that continuously verify intent versus behavior using constrained reasoning, and (3) enforces challenge-response attestation protocols for high-risk operations. We introduce OPERA (Observability, Provable Execution, Red-team, Attestation), a benchmark suite and evaluation protocol designed to measure (i) detectability of misalignment, (ii) time to detection under stealthy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Advanced Malware Detection Techniques
