Why do universal adversarial attacks work on large language models?: Geometry might be the answer
Varshini Subhash, Anna Bialas, Weiwei Pan, Finale Doshi-Velez

TL;DR
This paper introduces a geometric perspective to explain why universal adversarial attacks succeed on large language models, suggesting they approximate semantic information in embedding space, which could inform mitigation strategies.
Contribution
It proposes a novel geometric explanation for universal adversarial attacks on LLMs, supported by analysis of hidden representations in GPT-2.
Findings
Universal adversarial triggers may be embedding vectors approximating semantic info
White-box analysis shows dimensionality reduction and similarity measures support the hypothesis
Understanding geometry could help mitigate adversarial vulnerabilities
Abstract
Transformer based large language models with emergent capabilities are becoming increasingly ubiquitous in society. However, the task of understanding and interpreting their internal workings, in the context of adversarial attacks, remains largely unsolved. Gradient-based universal adversarial attacks have been shown to be highly effective on large language models and potentially dangerous due to their input-agnostic nature. This work presents a novel geometric perspective explaining universal adversarial attacks on large language models. By attacking the 117M parameter GPT-2 model, we find evidence indicating that universal adversarial triggers could be embedding vectors which merely approximate the semantic information in their adversarial training region. This hypothesis is supported by white-box model analysis comprising dimensionality reduction and similarity measurement of hidden…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Topic Modeling · Ferroelectric and Negative Capacitance Devices
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Softmax · Dense Connections · Linear Layer · Byte Pair Encoding
