One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs
Yixin Tan, Zhe Yu, Jun Sakuma

TL;DR
This paper demonstrates that jailbreak vulnerabilities in pretrained large language models are inherited by their finetuned versions, with transferable adversarial prompts exploiting linear separability in hidden states, posing significant security risks.
Contribution
It reveals the inheritance of jailbreak vulnerabilities from pretrained to finetuned LLMs and introduces the Probe-Guided Projection attack to exploit this transferability.
Findings
Adversarial prompts transfer effectively from pretrained to finetuned models.
Transferable prompts are linearly separable in hidden representations.
The proposed PGP attack achieves strong transfer success across models.
Abstract
Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection
