TL;DR
This paper introduces a mechanistic interpretability-based method to localize and understand vulnerabilities in large language models, demonstrated on GPT-2 for acronym prediction, to improve robustness against adversarial attacks.
Contribution
The paper presents a novel approach combining mechanistic interpretability techniques to identify and analyze specific vulnerabilities in LLMs for targeted tasks.
Findings
Successfully localized vulnerabilities in GPT-2 for acronym prediction
Generated adversarial samples to test model weaknesses
Provided insights into model failure modes and potential robustness improvements
Abstract
Large Language Models (LLMs), characterized by being trained on broad amounts of data in a self-supervised manner, have shown impressive performance across a wide range of tasks. Indeed, their generative abilities have aroused interest on the application of LLMs across a wide range of contexts. However, neural networks in general, and LLMs in particular, are known to be vulnerable to adversarial attacks, where an imperceptible change to the input can mislead the output of the model. This is a serious concern that impedes the use of LLMs on high-stakes applications, such as healthcare, where a wrong prediction can imply serious consequences. Even though there are many efforts on making LLMs more robust to adversarial attacks, there are almost no works that study \emph{how} and \emph{where} these vulnerabilities that make LLMs prone to adversarial attacks happen. Motivated by these facts,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Softmax · Dense Connections · Dropout · Linear Layer · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Weight Decay
