Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
Samuele Poppi, Zheng-Xin Yong, Yifei He, Bobbie Chern, Han Zhao, Aobo, Yang, Jianfeng Chi

TL;DR
This paper investigates the vulnerability of multilingual large language models to fine-tuning attacks, revealing their cross-lingual fragility and proposing a method to identify and analyze safety-related information in model parameters.
Contribution
It introduces the Safety Information Localization (SIL) method to identify safety-related parameters and demonstrates the cross-lingual transferability of fine-tuning attacks on multilingual LLMs.
Findings
Fine-tuning attacks can compromise multilingual LLMs across languages.
Changing 20% of parameters can break safety alignment.
Freezing safety-related parameters does not prevent attacks.
Abstract
Recent advancements in Large Language Models (LLMs) have sparked widespread concerns about their safety. Recent work demonstrates that safety alignment of LLMs can be easily removed by fine-tuning with a few adversarially chosen instruction-following examples, i.e., fine-tuning attacks. We take a further step to understand fine-tuning attacks in multilingual LLMs. We first discover cross-lingual generalization of fine-tuning attacks: using a few adversarially chosen instruction-following examples in one language, multilingual LLMs can also be easily compromised (e.g., multilingual LLMs fail to refuse harmful prompts in other languages). Motivated by this finding, we hypothesize that safety-related information is language-agnostic and propose a new method termed Safety Information Localization (SIL) to identify the safety-related information in the model parameter space. Through SIL, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNetwork Security and Intrusion Detection · Advanced Malware Detection Techniques · Web Application Security Vulnerabilities
