SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large   Language Models

Somnath Banerjee; Sayan Layek; Soham Tripathy; Shanu Kumar; Animesh; Mukherjee; Rima Hazra

arXiv:2406.12274·cs.CL·December 17, 2024

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models

Somnath Banerjee, Sayan Layek, Soham Tripathy, Shanu Kumar, Animesh, Mukherjee, Rima Hazra

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

SafeInfer introduces a novel decoding-time safety alignment method for large language models, enhancing safety and ethical compliance through adaptive techniques and a new safety evaluation benchmark.

Contribution

It proposes a context-adaptive safety alignment strategy with two phases and introduces HarmEval, a comprehensive safety evaluation benchmark for large language models.

Findings

01

Improved safety in language model outputs using SafeInfer

02

Effective safety evaluation with the HarmEval benchmark

03

Enhanced robustness of safety mechanisms against model editing

Abstract

Safety-aligned language models often exhibit fragile and imbalanced safety mechanisms, increasing the likelihood of generating unsafe content. In addition, incorporating new knowledge through editing techniques to language models can further compromise safety. To address these issues, we propose SafeInfer, a context-adaptive, decoding-time safety alignment strategy for generating safe responses to user queries. SafeInfer comprises two phases: the safety amplification phase, which employs safe demonstration examples to adjust the model's hidden states and increase the likelihood of safer outputs, and the safety-guided decoding phase, which influences token selection based on safety-optimized distributions, ensuring the generated content complies with ethical guidelines. Further, we present HarmEval, a novel benchmark for extensive safety evaluations, designed to address potential misuse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neuralsentinel/safeinfer
pytorchOfficial

Datasets

SoftMINER-Group/TechHazardQA
dataset· 57 dl
57 dl

Videos

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification