Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Abhilekh Borah; Chhavi Sharma; Danush Khanna; Utkarsh Bhatt; Gurpreet Singh; Hasnat Md Abdullah; Raghav Kaushik Ravi; Vinija Jain; Jyoti Patel; Shubham Singh; Vasu Sharma; Arpita Vats; Rahul Raja; Aman Chadha; Amitava Das

arXiv:2506.13901·cs.CL·June 18, 2025

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

Abhilekh Borah, Chhavi Sharma, Danush Khanna, Utkarsh Bhatt, Gurpreet Singh, Hasnat Md Abdullah, Raghav Kaushik Ravi, Vinija Jain, Jyoti Patel, Shubham Singh, Vasu Sharma, Arpita Vats, Rahul Raja, Aman Chadha, Amitava Das

PDF

Open Access 1 Video

TL;DR

The paper introduces the Alignment Quality Index (AQI), a geometric latent space metric that detects hidden misalignments and jailbreak risks in large language models, surpassing traditional behavioral evaluations.

Contribution

It presents AQI as a novel, decoding-invariant alignment diagnostic based on clustering measures in latent space, and introduces the LITMUS dataset for robust evaluation.

Findings

01

AQI correlates with external safety judgments.

02

AQI detects vulnerabilities missed by refusal metrics.

03

Empirical validation on LITMUS dataset shows effectiveness.

Abstract

Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations· underline

Taxonomy

TopicsSemantic Web and Ontologies · Data Mining Algorithms and Applications · Advanced Clustering Algorithms Research

MethodsDirect Preference Optimization