Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Kaustubh Dhole

arXiv:2512.23837·cs.CL·January 1, 2026

Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Kaustubh Dhole

PDF

Open Access

TL;DR

This paper introduces a novel method for generating adversarial examples by exploiting intermediate attention layers in language models, aiming to evaluate and stress-test LLM evaluation pipelines.

Contribution

It presents a new approach to create adversarial examples directly from attention-layer token distributions, differing from prompt or gradient-based methods.

Findings

01

Attention-based adversarial examples cause performance drops in evaluation tasks.

02

Substitutions from certain layers can introduce grammatical issues.

03

The method highlights both potential and limitations of internal model representations for adversarial testing.

Abstract

Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling