H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

Selim Furkan Tekin; Fatih Ilhan; Tiansheng Huang; Sihao Hu; Yichang Xu; Zachary Yahn; Ling Liu

arXiv:2411.17792·cs.CL·January 22, 2026

H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs

Selim Furkan Tekin, Fatih Ilhan, Tiansheng Huang, Sihao Hu, Yichang Xu, Zachary Yahn, Ling Liu

PDF

Open Access 1 Repo 1 Video

TL;DR

H3Fusion introduces a mixture-of-experts fusion mechanism for aligned LLMs, improving helpfulness, harmlessness, and honesty by balancing alignment properties through a controllable subspace drift and dual objectives.

Contribution

The paper presents a novel MoE-based fusion method that models alignment as a controllable drift, enhancing the integration of helpful, harmless, and honest properties in LLMs.

Findings

01

Outperforms individually aligned models by 11.37%.

02

Achieves 13.77% stronger robustness than state-of-the-art ensemble methods.

03

Provides 6.18% improvement over model-merging approaches.

Abstract

The alignment of pre-trained LLMs continues to draw significant attention from both industry and academia, aiming to ensure responses that are helpful, harmless, and honest. However, identifying a point in the model's representation subspace that simultaneously satisfies all these properties remains challenging. H3Fusion addresses this challenge by introducing a mixture-of-experts (MoE)-based fusion mechanism that models alignment as a controllable drift within the subspace, guided by a drift-regularization loss to balance competing alignment dimensions. Furthermore, we formulate the alignment by finding a dual objective of harnessing the distance of generated embeddings and alignment embeddings, and introduce a gating loss by canalizing the activations on the contributing experts. Extensive evaluations of three benchmark datasets show that H3Fusion is more helpful, less harmful, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sftekin/h3fusion
pytorchOfficial

Videos

H3Fusion: Helpful, Harmless, Honest Fusion of Aligned LLMs· underline

Taxonomy

TopicsParticle Accelerators and Free-Electron Lasers · Superconducting Materials and Applications · Particle accelerators and beam dynamics

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention