Stealth edits to large language models

Oliver J. Sutton; Qinghua Zhou; Wei Wang; Desmond J. Higham; Alexander; N. Gorban; Alexander Bastounis; Ivan Y. Tyukin

arXiv:2406.12670·cs.AI·October 31, 2024·1 cites

Stealth edits to large language models

Oliver J. Sutton, Qinghua Zhou, Wei Wang, Desmond J. Higham, Alexander, N. Gorban, Alexander Bastounis, Ivan Y. Tyukin

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a theoretical framework and new methods for editing large language models stealthily without retraining, revealing their vulnerability to malicious attacks and proposing solutions for targeted, selective modifications.

Contribution

The paper develops a metric for assessing model editability, introduces a new network block for precise editing, and demonstrates the susceptibility of models to stealth attacks.

Findings

01

A single metric predicts model editability and attack susceptibility.

02

New editing methods can modify models without retraining.

03

Models are vulnerable to simple, stealthy weight changes.

Abstract

We reveal the theoretical foundations of techniques for editing large language models, and present new methods which can do so without requiring retraining. Our theoretical insights show that a single metric (a measure of the intrinsic dimension of the model's features) can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks. This metric is fundamental to predicting the success of a variety of editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these as stealth editing methods, because they directly update a model's weights to specify its response to specific known hallucinating prompts without affecting other model behaviour. By carefully applying our theoretical insights, we are able to introduce a new jet-pack network block which is optimised for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qinghua-zhou/stealth-edits
pytorchOfficial

Videos

Stealth edits to large language models· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques