BadGPT-4o: stripping safety finetuning from GPT models
Ekaterina Krupkina, Dmitrii Volkov

TL;DR
This paper presents BadGPT-4o, a simple fine-tuning attack that removes GPT-4o's safety features without affecting its performance, matching advanced jailbreaks and remaining easy to execute.
Contribution
It introduces a straightforward poisoning technique that effectively strips safety guardrails from GPT models without degrading their performance.
Findings
BadGPT-4o matches top white-box jailbreaks on HarmBench and StrongREJECT.
The attack requires no token overhead or performance loss.
It remains easy to execute despite being known for a year.
Abstract
We show a version of Qi et al. 2023's simple fine-tuning poisoning technique strips GPT-4o's safety guardrails without degrading the model. The BadGPT attack matches best white-box jailbreaks on HarmBench and StrongREJECT. It suffers no token overhead or performance hits common to jailbreaks, as evaluated on tinyMMLU and open-ended generations. Despite having been known for a year, this attack remains easy to execute.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAnomaly Detection Techniques and Applications · Machine Learning and Data Classification · Software Reliability and Analysis Research
