Distributional Surgery for Language Model Activations

Bao Nguyen; Binh Nguyen; Duy Nguyen; Viet Anh Nguyen

arXiv:2501.15758·cs.LG·November 11, 2025

Distributional Surgery for Language Model Activations

Bao Nguyen, Binh Nguyen, Duy Nguyen, Viet Anh Nguyen

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces a two-stage method to detect and mitigate undesirable content in language models by rectifying activations through distributional steering, improving safety without significantly altering model behavior.

Contribution

It proposes a novel ensemble of classifiers for detection and a distributional steering approach via semidefinite programming for mitigation, advancing safety in language model outputs.

Findings

01

Outperforms baselines in reducing harmful outputs

02

Effective across multiple language models and datasets

03

Minimally perturbs attention distributions

Abstract

Language models, while capable of generating remarkably coherent and seemingly accurate text, can occasionally produce undesirable content, including harmful or toxic outputs. In this paper, we present a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layerwise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for detected undesirable contents, we propose layerwise distributional steering policies that transform the attention heads. These policies are computed through principled semidefinite programming, which aims to minimally perturb the attention distribution while probabilistically guaranteeing the effectiveness of the editions. Empirical evaluations across multiple language models and datasets show that our method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nguyenngocbaocmt02/OT-Intervention
pytorchOfficial

Datasets

baonn/nqopen
dataset· 59 dl
59 dl

Videos

Distributional Surgery for Language Model Activations· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need