Why do universal adversarial attacks work on large language models?:   Geometry might be the answer

Varshini Subhash; Anna Bialas; Weiwei Pan; Finale Doshi-Velez

arXiv:2309.00254·cs.LG·September 4, 2023·1 cites

Why do universal adversarial attacks work on large language models?: Geometry might be the answer

Varshini Subhash, Anna Bialas, Weiwei Pan, Finale Doshi-Velez

PDF

Open Access

TL;DR

This paper introduces a geometric perspective to explain why universal adversarial attacks succeed on large language models, suggesting they approximate semantic information in embedding space, which could inform mitigation strategies.

Contribution

It proposes a novel geometric explanation for universal adversarial attacks on LLMs, supported by analysis of hidden representations in GPT-2.

Findings

01

Universal adversarial triggers may be embedding vectors approximating semantic info

02

White-box analysis shows dimensionality reduction and similarity measures support the hypothesis

03

Understanding geometry could help mitigate adversarial vulnerabilities

Abstract

Transformer based large language models with emergent capabilities are becoming increasingly ubiquitous in society. However, the task of understanding and interpreting their internal workings, in the context of adversarial attacks, remains largely unsolved. Gradient-based universal adversarial attacks have been shown to be highly effective on large language models and potentially dangerous due to their input-agnostic nature. This work presents a novel geometric perspective explaining universal adversarial attacks on large language models. By attacking the 117M parameter GPT-2 model, we find evidence indicating that universal adversarial triggers could be embedding vectors which merely approximate the semantic information in their adversarial training region. This hypothesis is supported by white-box model analysis comprising dimensionality reduction and similarity measurement of hidden…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Topic Modeling · Ferroelectric and Negative Capacitance Devices

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Layer Normalization · Linear Warmup With Cosine Annealing · Attention Dropout · Softmax · Dense Connections · Linear Layer · Byte Pair Encoding