Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

Phongsakon Mark Konrad; Toygar Tanyel; Serkan Ayvaz

arXiv:2605.10575·cs.CR·May 12, 2026

Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims

Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz

PDF

TL;DR

This paper introduces Acceptance Cards, a comprehensive evaluation protocol for safe fine-tuning defenses, emphasizing reliability, mechanism alignment, and transferability to ensure genuine robustness claims.

Contribution

It proposes a standardized, executable audit framework that rigorously verifies the validity of fine-tuning defense claims beyond simple gap reduction metrics.

Findings

01

SafeLoRA fails the Acceptance Card diagnostics on Gemma-2-2B-it model.

02

Most model families do not satisfy all strict diagnostic criteria.

03

The protocol reveals limitations in current fine-tuning defense evaluations.

Abstract

Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.