Detecting and Understanding Vulnerabilities in Language Models via   Mechanistic Interpretability

Jorge Garc\'ia-Carrasco; Alejandro Mat\'e; Juan Trujillo

arXiv:2407.19842·cs.LG·July 30, 2024

Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability

Jorge Garc\'ia-Carrasco, Alejandro Mat\'e, Juan Trujillo

PDF

1 Repo

TL;DR

This paper introduces a mechanistic interpretability-based method to localize and understand vulnerabilities in large language models, demonstrated on GPT-2 for acronym prediction, to improve robustness against adversarial attacks.

Contribution

The paper presents a novel approach combining mechanistic interpretability techniques to identify and analyze specific vulnerabilities in LLMs for targeted tasks.

Findings

01

Successfully localized vulnerabilities in GPT-2 for acronym prediction

02

Generated adversarial samples to test model weaknesses

03

Provided insights into model failure modes and potential robustness improvements

Abstract

Large Language Models (LLMs), characterized by being trained on broad amounts of data in a self-supervised manner, have shown impressive performance across a wide range of tasks. Indeed, their generative abilities have aroused interest on the application of LLMs across a wide range of contexts. However, neural networks in general, and LLMs in particular, are known to be vulnerable to adversarial attacks, where an imperceptible change to the input can mislead the output of the model. This is a serious concern that impedes the use of LLMs on high-stakes applications, such as healthcare, where a wrong prediction can imply serious consequences. Even though there are many efforts on making LLMs more robust to adversarial attacks, there are almost no works that study \emph{how} and \emph{where} these vulnerabilities that make LLMs prone to adversarial attacks happen. Motivated by these facts,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jgcarrasco/detecting-vulnerabilities-mech-interp
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Cosine Annealing · Softmax · Dense Connections · Dropout · Linear Layer · Discriminative Fine-Tuning · Linear Warmup With Cosine Annealing · Weight Decay