GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Mark Russinovich; Yanan Cai; Keegan Hines; Giorgio Severi; Blake Bullwinkel; Ahmed Salem

arXiv:2602.06258·cs.LG·February 9, 2026

GRP-Obliteration: Unaligning LLMs With a Single Unlabeled Prompt

Mark Russinovich, Yanan Cai, Keegan Hines, Giorgio Severi, Blake Bullwinkel, Ahmed Salem

PDF

Open Access

TL;DR

This paper introduces GRP-Obliteration, a novel method that effectively unaligns safety-aligned large models using a single unlabeled prompt, preserving utility and outperforming existing techniques across various models and modalities.

Contribution

GRP-Obliteration is the first approach to reliably unalign models with a single unlabeled prompt, extending unalignment to diffusion models and outperforming prior methods.

Findings

01

Single prompt effectively unaligns safety models

02

Preserves model utility after unalignment

03

Generalizes to image generation systems

Abstract

Safety alignment is only as robust as its weakest failure mode. Despite extensive work on safety post-training, it has been shown that models can be readily unaligned through post-deployment fine-tuning. However, these methods often require extensive data curation and degrade model utility. In this work, we extend the practical limits of unalignment by introducing GRP-Obliteration (GRP-Oblit), a method that uses Group Relative Policy Optimization (GRPO) to directly remove safety constraints from target models. We show that a single unlabeled prompt is sufficient to reliably unalign safety-aligned models while largely preserving their utility, and that GRP-Oblit achieves stronger unalignment on average than existing state-of-the-art techniques. Moreover, GRP-Oblit generalizes beyond language models and can also unalign diffusion-based image generation systems. We evaluate GRP-Oblit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Software Testing and Debugging Techniques