H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs
Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Yichang Xu, Zachary Yahn, Ling Liu

TL;DR
H3Fusion introduces a mixture-of-experts fusion mechanism for aligned LLMs, improving helpfulness, harmlessness, and honesty by balancing alignment properties through a controllable subspace drift and dual objectives.
Contribution
The paper presents a novel MoE-based fusion method that models alignment as a controllable drift, enhancing the integration of helpful, harmless, and honest properties in LLMs.
Findings
Outperforms individually aligned models by 11.37%.
Achieves 13.77% stronger robustness than state-of-the-art ensemble methods.
Provides 6.18% improvement over model-merging approaches.
Abstract
The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model's representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce a gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsParticle Accelerators and Free-Electron Lasers · Superconducting Materials and Applications · Particle accelerators and beam dynamics
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention
