AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Danush Khanna; Gurucharan Marthi Krishna Kumar; Basab Ghosh; Yaswanth Narsupalli; Vinija Jain; Vasu Sharma; Aman Chadha; Amitava Das

arXiv:2506.08885·cs.CL·September 30, 2025

AdversariaL attacK sAfety aLIgnment(ALKALI): Safeguarding LLMs through GRACE: Geometric Representation-Aware Contrastive Enhancement- Introducing Adversarial Vulnerability Quality Index (AVQI)

Danush Khanna, Gurucharan Marthi Krishna Kumar, Basab Ghosh, Yaswanth Narsupalli, Vinija Jain, Vasu Sharma, Aman Chadha, Amitava Das

PDF

TL;DR

This paper exposes a geometric blind spot in LLM safety, introduces ALKALI as a comprehensive adversarial benchmark, and proposes GRACE to improve alignment by reshaping internal representations, significantly reducing attack success.

Contribution

It introduces ALKALI, the first extensive adversarial benchmark for LLMs, and proposes GRACE, a novel alignment framework that mitigates latent camouflage vulnerabilities without altering the base model.

Findings

01

High attack success rates across models expose latent camouflage vulnerability.

02

GRACE reduces attack success rates by up to 39%.

03

AVQI effectively quantifies internal safety encoding failures.

Abstract

Adversarial threats against LLMs are escalating faster than current defenses can adapt. We expose a critical geometric blind spot in alignment: adversarial prompts exploit latent camouflage, embedding perilously close to the safe representation manifold while encoding unsafe intent thereby evading surface level defenses like Direct Preference Optimization (DPO), which remain blind to the latent geometry. We introduce ALKALI, the first rigorously curated adversarial benchmark and the most comprehensive to date spanning 9,000 prompts across three macro categories, six subtypes, and fifteen attack families. Evaluation of 21 leading LLMs reveals alarmingly high Attack Success Rates (ASRs) across both open and closed source models, exposing an underlying vulnerability we term latent camouflage, a structural blind spot where adversarial completions mimic the latent geometry of safe ones. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Balanced Selection · Attentive Walk-Aggregating Graph Neural Network