Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
Rong Xiang

TL;DR
This paper introduces the PEA architecture, a system-level design that enforces goal integrity in AI agents through separation of powers, cryptographic tokens, and formal verification, addressing agentic misalignment risks.
Contribution
The paper presents a novel system architecture with five core components that structurally enforce safety and goal integrity in autonomous AI agents.
Findings
PEA architecture effectively enforces goal integrity even under adversarial conditions.
Cryptographically anchored intent tracking ensures traceability of agent actions.
Formal verification proves goal safety under potential system compromises.
Abstract
Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
