Acceptance Cards:A Four-Diagnostic Standard for Safe Fine-Tuning Defense Claims
Phongsakon Mark Konrad, Toygar Tanyel, Serkan Ayvaz

TL;DR
This paper introduces Acceptance Cards, a comprehensive evaluation protocol for safe fine-tuning defenses, emphasizing reliability, mechanism alignment, and transferability to ensure genuine robustness claims.
Contribution
It proposes a standardized, executable audit framework that rigorously verifies the validity of fine-tuning defense claims beyond simple gap reduction metrics.
Findings
SafeLoRA fails the Acceptance Card diagnostics on Gemma-2-2B-it model.
Most model families do not satisfy all strict diagnostic criteria.
The protocol reveals limitations in current fine-tuning defense evaluations.
Abstract
Safe fine-tuning defenses are often endorsed on the basis of a held-out gap reduction, but the same reduction can come from sampling noise, subject artifacts, capability loss, or a mechanism that does not transfer. We introduce Acceptance Cards: an evaluation protocol, a documentation object, an executable audit package, and a claim-specific evidential standard for safe fine-tuning defense claims. The protocol checks statistical reliability, fresh semantic generalization, mechanism alignment, and cross-task transfer before treating a gap reduction as a full-card pass. Re-scored under this installed-gap protocol, SafeLoRA fails the full-card pass on Gemma-2-2B-it: under strict mechanism-class coding it fails all four diagnostics, and under a permissive shrinkage relabel it still fails three of four. This is a narrow installed-gap audit on one model family, not a global judgment of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
