Test-Time Safety Alignment

Baturay Saglam; Dionysis Kalogerias

arXiv:2604.26167·cs.CL·April 30, 2026

Test-Time Safety Alignment

Baturay Saglam, Dionysis Kalogerias

PDF

TL;DR

This paper demonstrates that optimizing input word embeddings via zeroth-order gradient estimation can effectively reduce harmful outputs of aligned language models, enhancing safety control.

Contribution

It introduces a novel method using black-box API feedback to optimize embeddings for safety, applicable to complex bimodal response distributions.

Findings

01

The method neutralizes all safety-flagged responses on benchmarks.

02

Input embeddings can be optimized in a sub-lexical manner for safety.

03

Zeroth-order gradient estimation effectively guides embedding optimization.

Abstract

Recent work has shown that a model's input word embeddings can serve as effective control variables for steering its behavior toward outputs that satisfy desired properties. However, this has only been demonstrated for pretrained text-completion models on the relatively simple objective of reducing surface-level profanity in short continuations. A natural and practically important question is how well input embeddings can control aligned models, which produce an imbalanced bimodal refuse-or-comply output distribution rather than the smooth distribution characteristic of open-ended generation. We explore this in the context of safety, showing that input word embeddings can be optimized in a sub-lexical manner to minimize the semantic harmfulness of aligned model responses. Our approach uses zeroth-order gradient estimation of a black-box text-moderation API with respect to the input…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.