Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

Xinran Zhang

arXiv:2603.14723·cs.CL·March 17, 2026

Beyond Creed: A Non-Identity Safety Condition A Strong Empirical Alternative to Identity Framing in Low-Data LoRA Fine-Tuning

Xinran Zhang

PDF

Open Access

TL;DR

This paper demonstrates that non-identity safety supervision formats outperform creed-style identity framing in low-data LoRA fine-tuning across multiple models, challenging the necessity of explicit identity language for safety improvements.

Contribution

It introduces a non-identity safety supervision condition that surpasses creed-style identity framing in effectiveness, providing an empirical alternative to identity-based safety methods.

Findings

01

Non-identity condition D achieves highest refusal rates across models.

02

Creed-style framing improves over constitutional rules but remains less effective than D.

03

No significant trade-offs in capability evaluations across different supervision formats.

Abstract

How safety supervision is written may matter more than the explicit identity content it contains. We study low-data LoRA safety fine-tuning with four supervision formats built from the same core safety rules: constitutional rules (A), creed-style identity framing (B), a B-matched creed condition with a worldview/confession identity-maintenance tail (C), and a matched non-identity condition (D). Across three instruction-tuned model families (Llama 3.1 8B, Qwen2.5 7B, and Gemma 3 4B), we evaluate HarmBench using a reconciled dual-judge pipeline combining Bedrock-hosted DeepSeek v3.2 and Sonnet 4.6, with disagreement and boundary cases manually resolved. The non-identity condition D is the strongest group on all three model families on the full 320-behavior HarmBench set, reaching 74.4% refusal on Llama, 76.9% on Gemma, and 74.1% on Qwen. By comparison, creed-style framing (B) improves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)