Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs
Megh Thakkar, Quentin Fournier, Matthew Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, Sarath Chandar

TL;DR
This paper introduces MergeAlign, a merging-based method to combine domain expertise and safety alignment in LLMs, improving safety without sacrificing domain-specific performance.
Contribution
The paper presents MergeAlign, a novel merging technique that enhances safety in domain-specific LLMs while maintaining their utility, addressing a key challenge in specialized model development.
Findings
MergeAlign improves safety alignment in Llama3 domain models.
Minimal performance loss on domain benchmarks after merging.
Model similarity metrics explain merging effectiveness.
Abstract
There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called \textsc{MergeAlign} that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply \textsc{MergeAlign} on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Rights Management and Security
