LiSA: Lifelong Safety Adaptation via Conservative Policy Induction
Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar, Phillip Wallis, Bharath Chandrasekhar, Kyomin Jung, Tomas Pfister, Long T. Le

TL;DR
LiSA is a framework that enhances AI safety guardrails by converting sparse failure feedback into reusable policies, improving robustness and adaptability in real-world deployment environments.
Contribution
LiSA introduces a conservative policy induction method with structured memory and confidence gating to adapt guardrails using limited, noisy feedback.
Findings
LiSA outperforms memory-based baselines under sparse feedback.
LiSA remains robust with up to 20% label-flip noise.
LiSA improves latency-performance trade-offs beyond baseline models.
Abstract
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
