When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

Gautam Siddharth Kashyap; Mark Dras; Usman Naseem

arXiv:2602.07381·cs.CL·February 10, 2026

When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

Gautam Siddharth Kashyap, Mark Dras, Usman Naseem

PDF

Open Access 1 Video

TL;DR

AlignX is a novel two-stage framework that improves alignment of large language models with human values by addressing multi-objective conflicts and routing issues, leading to better helpfulness, harmlessness, and honesty.

Contribution

The paper introduces AlignX, combining prompt-injected fine-tuning and fractal geometry-based expert routing to enhance multi-objective alignment in LLMs, overcoming Axis Collapse.

Findings

01

Significant improvements on Alpaca, BeaverTails, and TruthfulQA benchmarks.

02

Over 35% reduction in latency and memory usage compared to prior MoE methods.

03

Validated generalizability across four different LLM architectures.

Abstract

Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

When the Model Said "No Comment", We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI