Discovering Agentic Safety Specifications from 1-Bit Danger Signals
V\'ictor Gallego

TL;DR
This paper introduces EPO-Safe, a framework enabling large language models to autonomously discover safety constraints from minimal danger signals in structured environments, improving safety reasoning.
Contribution
EPO-Safe demonstrates that LLMs can learn safety specifications from binary danger signals without rich feedback, outperforming reward-based reflection in safety discovery.
Findings
EPO-Safe finds safety constraints within 1-2 rounds in various environments.
Reflection on reward alone can lead to reward hacking, not safety.
Safety specifications remain robust even with 50% noisy danger signals.
Abstract
Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function , only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward may diverge from .…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
