A Self-Improving Architecture for Dynamic Safety in Large Language Models
Tyler Slater

TL;DR
This paper introduces SISF, a self-improving architecture for LLM safety that autonomously detects breaches and synthesizes defenses at runtime, significantly reducing attack success rates.
Contribution
The paper presents the SISF framework, enabling LLMs to adaptively improve safety defenses without retraining through a MAPE-K based feedback loop.
Findings
SISF achieved a mean Attack Success Rate of 0.27%.
Generated 240 defense policies per trial across experiments.
Reduced residual ASR from 7.88% to 0.00% when stacked with Llama Guard 4.
Abstract
Context: Large Language Models (LLMs) rely on static, pre-deployment safety mechanisms that cannot adapt to adversarial threats discovered after release. Objective: To design a software architecture enabling LLM-based systems to autonomously detect safety failures and synthesize defense policies at runtime, without retraining or manual intervention. Method: We propose the Self-Improving Safety Framework (SISF), grounded in the MAPE-K reference model. The framework couples a target LLM with a feedback loop: an Adjudicator detects breaches, a Policy Synthesis Module generates dual-mechanism defense policies (heuristic and semantic), and a Warden enforces them. We conducted seven experiments (10,061 evaluations) across four model families. Results: Across five reproducibility trials, SISF achieved a mean Attack Success Rate (ASR) of 0.27% (+/-0.15%), autonomously generating 240 policies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
