Discovering Agentic Safety Specifications from 1-Bit Danger Signals

V\'ictor Gallego

arXiv:2604.23210·cs.AI·April 28, 2026

Discovering Agentic Safety Specifications from 1-Bit Danger Signals

V\'ictor Gallego

PDF

TL;DR

This paper introduces EPO-Safe, a framework enabling large language models to autonomously discover safety constraints from minimal danger signals in structured environments, improving safety reasoning.

Contribution

EPO-Safe demonstrates that LLMs can learn safety specifications from binary danger signals without rich feedback, outperforming reward-based reflection in safety discovery.

Findings

01

EPO-Safe finds safety constraints within 1-2 rounds in various environments.

02

Reflection on reward alone can lead to reward hacking, not safety.

03

Safety specifications remain robust even with 50% noisy danger signals.

Abstract

Can large language model agents discover hidden safety objectives through experience alone? We introduce EPO-Safe (Experiential Prompt Optimization for Safe Agents), a framework where an LLM iteratively generates action plans, receives sparse binary danger warnings, and evolves a natural language behavioral specification through reflection. Unlike standard LLM reflection methods that rely on rich textual feedback (e.g., compiler errors or detailed environment responses), EPO-Safe demonstrates that LLMs can perform safety reasoning from a strictly impoverished signal in structured, low-dimensional environments: the agent never observes the hidden performance function $R^{*}$ , only a single bit per timestep indicating that an action was unsafe. We evaluate on five AI Safety Gridworlds (Leike et al., 2017) and five text-based scenario analogs where visible reward $R$ may diverge from $R^{*}$ .…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.