Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin

TL;DR
This paper investigates how fine-tuning large language models on specific datasets can unintentionally introduce vulnerabilities, analyzing dataset factors and proposing insights for better adversarial robustness and model safety.
Contribution
It identifies key dataset characteristics that contribute to accidental vulnerabilities in fine-tuned models and explores causal relationships to improve adversarial defenses.
Findings
Linguistic features, semantic similarity, and toxicity influence vulnerability levels.
Dataset characteristics correlate with attack success rates.
Understanding dataset factors aids in developing better defense strategies.
Abstract
As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the…
Peer Reviews
Decision·Submitted to ICLR 2026
1. Overall the paper is clear and the structure is logical. 2. The work employs a structured empirical way to investigate the problem. The use of standard, state-of-the-art adversarial attacks from HarmBench looks like a good evaluation framework. 3. The incorporation of causal mediation analysis is an ambitious and welcome step to move beyond simple correlation and attempt to causally link dataset features to model vulnerability.
The paper addresses an important problem: maintaining LLM safety after fine-tuning. However, I am not convinced by the overall contribution. 1. Low Novelty: The general idea that fine-tuning can degrade safety is known. The paper does not bring a significant originality of ideas or execution that advances the state of the art beyond confirming existing intuitions with a marginal effect size. 2. Weak Empirical Evidence for the Core Claim: The "Accidental Vulnerability" in domain-specific, non-h
- Studying the data: The paper aims to gain a deeper understanding of how fine-tuning data affects safety-related model behaviors. This is an important question, and, in particular, directly studying the data is often neglected in the literature. I consider it a strength of this paper that it pushes in this direction.
- Phenomena essentially identical or very similar to accidental vulnerability are widely known in the fine-tuning and safety literature, so that the claim of the authors that they are introducing a new concept (Section 6) seems unjustified. See e.g. (Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, Betley et al., 2025) , (Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks, Jain et al., 2023). - Lack of coherence and research structure
## originality The notion that finetuning on a given dataset (whether benign or malicious) can make a model less robust is not a new idea. However, studying it so directly, and asking which factors contribute more or less, is something that I'm seeing for the first time here. ## quality Overall the experiments are of good quality. ## clarity The explanatory tables are helpful in making clear exactly what is in the different datasets. Some of the plots are quite clear. ## significance If the p
## Flow, clarity, message My main challenge with this paper is its overall presentation and lack of clear message. The results individually are mostly (but not all) understandable. But starting from Section 3, and especially from Section 4, I am unsure what message the authors are trying to convey. I'll go into more details later in the "Questions" section of the review. ## Not referencing tables and plots Most of the tables and plots in the paper and not referenced anywhere in the main text. E
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Topic Modeling
