Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks
Jack Youstra, Mohammed Mahfoud, Yang Yan, Henry Sleight, Ethan Perez, Mrinank Sharma

TL;DR
This paper introduces CIFR, a benchmark for evaluating defenses against cipher-based attacks on LLM fine-tuning APIs, demonstrating high detection accuracy and generalization of probe monitors.
Contribution
The paper formalizes the fine-tuning API defense problem, introduces the CIFR benchmark with diverse cipher encodings, and evaluates defenses achieving over 99% detection accuracy.
Findings
Probe monitors achieve over 99% detection accuracy.
Monitors generalize well to unseen cipher variants and families.
CIFR and code are open-sourced for further research.
Abstract
Large language model fine-tuning APIs enable widespread model customization, yet pose significant safety risks. Recent work shows that adversaries can exploit access to these APIs to bypass model safety mechanisms by encoding harmful content in seemingly harmless fine-tuning data, evading both human monitoring and standard content filters. We formalize the fine-tuning API defense problem, and introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety in the face of cipher-enabled attackers while achieving the desired level of fine-tuning functionality. We include diverse cipher encodings and families, with some kept exclusively in the test set to evaluate for generalization across unseen ciphers and cipher families. We then evaluate different defenses on the benchmark and train probe monitors on model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
