MetaSC: Test-Time Safety Specification Optimization for Language Models

V\'ictor Gallego

arXiv:2502.07985·cs.CL·April 8, 2025

MetaSC: Test-Time Safety Specification Optimization for Language Models

V\'ictor Gallego

PDF

Open Access 1 Repo

TL;DR

MetaSC introduces a test-time safety optimization framework for language models that adaptively refines safety prompts during inference, enhancing safety performance without altering model weights.

Contribution

It presents a novel meta-critique mechanism that dynamically updates safety specifications at inference time, improving safety across various tasks and adversarial scenarios.

Findings

01

Significantly higher safety scores with dynamic prompts

02

Effective against adversarial jailbreak requests

03

Improves safety in moral and honesty tasks

Abstract

We propose a novel dynamic safety framework that optimizes language model (LM) safety reasoning at inference time without modifying model weights. Building on recent advances in self-critique methods, our approach leverages a meta-critique mechanism that iteratively updates safety prompts-termed specifications-to drive the critique and revision process adaptively. This test-time optimization not only improves performance against adversarial jailbreak requests but also in diverse general safety-related tasks, such as avoiding moral harm or pursuing honest responses. Our empirical evaluations across several language models demonstrate that dynamically optimized safety prompts yield significantly higher safety scores compared to fixed system prompts and static self-critique defenses. Code released at https://github.com/vicgalle/meta-self-critique.git .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vicgalle/meta-self-critique
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Testing and Debugging Techniques · Safety Systems Engineering in Autonomy · Software Reliability and Analysis Research