BadGPT-4o: stripping safety finetuning from GPT models

Ekaterina Krupkina; Dmitrii Volkov

arXiv:2412.05346·cs.CR·December 10, 2024

BadGPT-4o: stripping safety finetuning from GPT models

Ekaterina Krupkina, Dmitrii Volkov

PDF

Open Access

TL;DR

This paper presents BadGPT-4o, a simple fine-tuning attack that removes GPT-4o's safety features without affecting its performance, matching advanced jailbreaks and remaining easy to execute.

Contribution

It introduces a straightforward poisoning technique that effectively strips safety guardrails from GPT models without degrading their performance.

Findings

01

BadGPT-4o matches top white-box jailbreaks on HarmBench and StrongREJECT.

02

The attack requires no token overhead or performance loss.

03

It remains easy to execute despite being known for a year.

Abstract

We show a version of Qi et al. 2023's simple fine-tuning poisoning technique strips GPT-4o's safety guardrails without degrading the model. The BadGPT attack matches best white-box jailbreaks on HarmBench and StrongREJECT. It suffers no token overhead or performance hits common to jailbreaks, as evaluated on tinyMMLU and open-ended generations. Despite having been known for a year, this attack remains easy to execute.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications · Machine Learning and Data Classification · Software Reliability and Analysis Research