One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Yixin Tan; Zhe Yu; Jun Sakuma

arXiv:2512.14751·cs.CR·December 18, 2025

One Leak Away: How Pretrained Model Exposure Amplifies Jailbreak Risks in Finetuned LLMs

Yixin Tan, Zhe Yu, Jun Sakuma

PDF

Open Access

TL;DR

This paper demonstrates that jailbreak vulnerabilities in pretrained large language models are inherited by their finetuned versions, with transferable adversarial prompts exploiting linear separability in hidden states, posing significant security risks.

Contribution

It reveals the inheritance of jailbreak vulnerabilities from pretrained to finetuned LLMs and introduces the Probe-Guided Projection attack to exploit this transferability.

Findings

01

Adversarial prompts transfer effectively from pretrained to finetuned models.

02

Transferable prompts are linearly separable in hidden representations.

03

The proposed PGP attack achieves strong transfer success across models.

Abstract

Finetuning pretrained large language models (LLMs) has become the standard paradigm for developing downstream applications. However, its security implications remain unclear, particularly regarding whether finetuned LLMs inherit jailbreak vulnerabilities from their pretrained sources. We investigate this question in a realistic pretrain-to-finetune threat model, where the attacker has white-box access to the pretrained LLM and only black-box access to its finetuned derivatives. Empirical analysis shows that adversarial prompts optimized on the pretrained model transfer most effectively to its finetuned variants, revealing inherited vulnerabilities from pretrained to finetuned LLMs. To further examine this inheritance, we conduct representation-level probing, which shows that transferable prompts are linearly separable within the pretrained hidden states, suggesting that universal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Hate Speech and Cyberbullying Detection