Learning Safety Constraints for Large Language Models

Xin Chen; Yarden As; Andreas Krause

arXiv:2505.24445·cs.LG·June 2, 2025

Learning Safety Constraints for Large Language Models

Xin Chen, Yarden As, Andreas Krause

PDF

Open Access 1 Repo

TL;DR

SaP introduces a geometric method to enforce safety constraints in large language models by learning a safety polytope in the representation space, enabling detection and correction of unsafe outputs without altering the model weights.

Contribution

The paper presents SaP, a novel post-hoc safety framework that learns a safety polytope in the representation space to improve LLM safety and interpretability.

Findings

01

Effectively detects unethical inputs

02

Reduces adversarial attack success rates

03

Maintains performance on standard tasks

Abstract

Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP, short for Safety Polytope, a geometric approach to LLM safety that learns and enforces multiple safety constraints directly in the model's representation space. We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lasgroup/safetypolytope
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling