BarrierSteer: LLM Safety via Learning Barrier Steering

Thanh Q. Tran; Arun Verma; Kiwan Wong; Bryan Kian Hsiang Low; Daniela Rus; Wei Xiao

arXiv:2602.20102·cs.LG·February 24, 2026

BarrierSteer: LLM Safety via Learning Barrier Steering

Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao

PDF

Open Access

TL;DR

BarrierSteer introduces a safety framework for large language models that embeds learned safety constraints into the model's latent space, effectively reducing unsafe outputs without altering the original model.

Contribution

It proposes a novel safety mechanism using Control Barrier Functions in latent space, providing theoretical guarantees and practical effectiveness for LLM safety.

Findings

01

Significantly reduces adversarial attack success rates

02

Decreases unsafe content generation

03

Maintains original model performance

Abstract

Despite the state-of-the-art performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a major obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and supported by rigorous theory. We introduce BarrierSteer, a novel framework that formalizes response safety by embedding learned non-linear safety constraints directly into the model's latent representation space. BarrierSteer employs a steering mechanism based on Control Barrier Functions (CBFs) to efficiently detect and prevent unsafe response trajectories during inference with high precision. By enforcing multiple safety constraints through efficient constraint merging, without modifying the underlying LLM parameters, BarrierSteer preserves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Domain Adaptation and Few-Shot Learning