On Almost Surely Safe Alignment of Large Language Models at Inference-Time

Xiaotong Ji; Shyam Sundhar Ramesh; Matthieu Zimmer; Ilija Bogunovic; Jun Wang; Haitham Bou Ammar

arXiv:2502.01208·cs.LG·June 23, 2025

On Almost Surely Safe Alignment of Large Language Models at Inference-Time

Xiaotong Ji, Shyam Sundhar Ramesh, Matthieu Zimmer, Ilija Bogunovic, Jun Wang, Haitham Bou Ammar

PDF

Open Access

TL;DR

This paper presents InferenceGuard, a novel inference-time alignment method for large language models that guarantees safe responses with high probability by modeling response generation as a constrained MDP in the latent space.

Contribution

It introduces a formal safety guarantee framework for LLMs at inference-time using a constrained MDP approach and proposes InferenceGuard, a practical, weight-free safety alignment method.

Findings

01

InferenceGuard effectively balances safety and task performance.

02

It outperforms existing inference-time alignment methods.

03

Provides formal safety guarantees under the proposed model.

Abstract

We introduce a novel inference-time alignment approach for LLMs that aims to generate safe responses almost surely, i.e., with probability approaching one. Our approach models the generation of safe responses as a constrained Markov Decision Process (MDP) within the LLM's latent space. We augment a safety state that tracks the evolution of safety constraints and dynamically penalize unsafe generations to ensure the generation of safe responses. Consequently, we demonstrate formal safety guarantees w.r.t. the given cost model upon solving the MDP in the latent space with sufficiently large penalties. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate that InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques