Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Megh Thakkar; Quentin Fournier; Matthew Riemer; Pin-Yu Chen; Amal Zouaq; Payel Das; Sarath Chandar

arXiv:2411.06824·cs.AI·June 2, 2025

Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs

Megh Thakkar, Quentin Fournier, Matthew Riemer, Pin-Yu Chen, Amal Zouaq, Payel Das, Sarath Chandar

PDF

Open Access

TL;DR

This paper introduces MergeAlign, a merging-based method to combine domain expertise and safety alignment in LLMs, improving safety without sacrificing domain-specific performance.

Contribution

The paper presents MergeAlign, a novel merging technique that enhances safety in domain-specific LLMs while maintaining their utility, addressing a key challenge in specialized model development.

Findings

01

MergeAlign improves safety alignment in Llama3 domain models.

02

Minimal performance loss on domain benchmarks after merging.

03

Model similarity metrics explain merging effectiveness.

Abstract

There is a growing interest in training domain-expert LLMs that excel in specific technical fields compared to their general-purpose instruction-tuned counterparts. However, these expert models often experience a loss in their safety abilities in the process, making them capable of generating harmful content. As a solution, we introduce an efficient and effective merging-based alignment method called \textsc{MergeAlign} that interpolates the domain and alignment vectors, creating safer domain-specific models while preserving their utility. We apply \textsc{MergeAlign} on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks. We study the impact of model merging through model similarity metrics and contributions of individual models being merged. We hope our findings open new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Rights Management and Security