Stealth edits to large language models
Oliver J. Sutton, Qinghua Zhou, Wei Wang, Desmond J. Higham, Alexander, N. Gorban, Alexander Bastounis, Ivan Y. Tyukin

TL;DR
This paper introduces a theoretical framework and new methods for editing large language models stealthily without retraining, revealing their vulnerability to malicious attacks and proposing solutions for targeted, selective modifications.
Contribution
The paper develops a metric for assessing model editability, introduces a new network block for precise editing, and demonstrates the susceptibility of models to stealth attacks.
Findings
A single metric predicts model editability and attack susceptibility.
New editing methods can modify models without retraining.
Models are vulnerable to simple, stealthy weight changes.
Abstract
We reveal the theoretical foundations of techniques for editing large language models, and present new methods which can do so without requiring retraining. Our theoretical insights show that a single metric (a measure of the intrinsic dimension of the model's features) can be used to assess a model's editability and reveals its previously unrecognised susceptibility to malicious stealth attacks. This metric is fundamental to predicting the success of a variety of editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these as stealth editing methods, because they directly update a model's weights to specify its response to specific known hallucinating prompts without affecting other model behaviour. By carefully applying our theoretical insights, we are able to introduce a new jet-pack network block which is optimised for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
