When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified
Gautam Siddharth Kashyap, Mark Dras, Usman Naseem

TL;DR
AlignX is a novel two-stage framework that improves alignment of large language models with human values by addressing multi-objective conflicts and routing issues, leading to better helpfulness, harmlessness, and honesty.
Contribution
The paper introduces AlignX, combining prompt-injected fine-tuning and fractal geometry-based expert routing to enhance multi-objective alignment in LLMs, overcoming Axis Collapse.
Findings
Significant improvements on Alpaca, BeaverTails, and TruthfulQA benchmarks.
Over 35% reduction in latency and memory usage compared to prior MoE methods.
Validated generalizability across four different LLM architectures.
Abstract
Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
