Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes
Divyanshu Kumar, Anurakt Kumar, Sahil Agarwal, Prashanth Harshangi

TL;DR
This paper examines how fine-tuning and quantization affect the safety vulnerabilities of large language models, revealing that fine-tuning often increases jailbreak success, while guardrails improve safety.
Contribution
It provides a comprehensive analysis of the impact of fine-tuning and quantization on LLM safety, offering insights for developing more robust safety measures.
Findings
Fine-tuning generally increases jailbreak attack success rates.
Quantization has mixed effects on model vulnerability.
Implementing guardrails significantly improves resistance to jailbreaks.
Abstract
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. However, these models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks. These vulnerabilities can lead to the generation of malicious content, unauthorized actions, or the disclosure of confidential information. While foundational LLMs undergo alignment training and incorporate safety measures, they are often subject to fine-tuning, or doing quantization resource-constrained environments. This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems. We evaluate foundational models including Mistral, Llama series, Qwen, and MosaicML, along with their fine-tuned variants. Our comprehensive analysis reveals that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCryptography and Data Security · Security and Verification in Computing
MethodsLLaMA
