MetaSC: Test-Time Safety Specification Optimization for Language Models
V\'ictor Gallego

TL;DR
MetaSC introduces a test-time safety optimization framework for language models that adaptively refines safety prompts during inference, enhancing safety performance without altering model weights.
Contribution
It presents a novel meta-critique mechanism that dynamically updates safety specifications at inference time, improving safety across various tasks and adversarial scenarios.
Findings
Significantly higher safety scores with dynamic prompts
Effective against adversarial jailbreak requests
Improves safety in moral and honesty tasks
Abstract
We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code released at https://github.com/vicgalle/meta-self-critique.git .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Testing and Debugging Techniques · Safety Systems Engineering in Autonomy · Software Reliability and Analysis Research
