Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Jack Youstra; Mohammed Mahfoud; Yang Yan; Henry Sleight; Ethan Perez; Mrinank Sharma

arXiv:2508.17158·cs.LG·August 26, 2025

Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks

Jack Youstra, Mohammed Mahfoud, Yang Yan, Henry Sleight, Ethan Perez, Mrinank Sharma

PDF

TL;DR

This paper introduces CIFR, a benchmark for evaluating defenses against cipher-based attacks on LLM fine-tuning APIs, demonstrating high detection accuracy and generalization of probe monitors.

Contribution

The paper formalizes the fine-tuning API defense problem, introduces the CIFR benchmark with diverse cipher encodings, and evaluates defenses achieving over 99% detection accuracy.

Findings

01

Probe monitors achieve over 99% detection accuracy.

02

Monitors generalize well to unseen cipher variants and families.

03

CIFR and code are open-sourced for further research.

Abstract

Large language model fine-tuning APIs enable widespread model customization, yet pose significant safety risks. Recent work shows that adversaries can exploit access to these APIs to bypass model safety mechanisms by encoding harmful content in seemingly harmless fine-tuning data, evading both human monitoring and standard content filters. We formalize the fine-tuning API defense problem, and introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety in the face of cipher-enabled attackers while achieving the desired level of fine-tuning functionality. We include diverse cipher encodings and families, with some kept exclusively in the test set to evaluate for generalization across unseen ciphers and cipher families. We then evaluate different defenses on the benchmark and train probe monitors on model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.