Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Xander Davies; Eric Winsor; Alexandra Souly; Tomek Korbak; Robert Kirk; Christian Schroeder de Witt; Yarin Gal

arXiv:2502.14828·cs.LG·October 27, 2025

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Xander Davies, Eric Winsor, Alexandra Souly, Tomek Korbak, Robert Kirk, Christian Schroeder de Witt, Yarin Gal

PDF

Open Access

TL;DR

This paper demonstrates that pointwise detection defenses for fine-tuning APIs are fundamentally limited, as attackers can covertly transmit harmful knowledge using benign samples that evade existing safeguards.

Contribution

The work introduces 'pointwise-undetectable' attacks that exploit benign model output variations to bypass fine-tuning defenses, revealing fundamental limitations of current detection methods.

Findings

01

Attacks successfully elicited harmful responses from OpenAI API.

02

Proposed attacks evade enhanced monitoring systems.

03

Demonstrates the need for new defense strategies.

Abstract

LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security