Targeted Angular Reversal of Weights (TARS) for Knowledge Removal in Large Language Models
Harry J. Davies, Giorgos Iacovides, Danilo P. Mandic

TL;DR
The paper introduces TARS, a novel method for removing specific knowledge from large language models by reversing internal concept vectors, achieving effective removal across languages with minimal impact on overall performance.
Contribution
TARS is a new, modular technique that precisely targets and reverses internal concept representations in LLMs for effective knowledge removal.
Findings
TARS reduces target concept trigger probability to zero with a single edit.
Knowledge removal is effective across multiple languages.
Minimal impact on overall model performance after multiple concept removals.
Abstract
The sheer scale of data required to train modern large language models (LLMs) poses significant risks, as models are likely to gain knowledge of sensitive topics such as bio-security, as well the ability to replicate copyrighted works. Methods designed to remove such knowledge must do so from all prompt directions, in a multi-lingual capacity and without degrading general model performance. To this end, we introduce the targeted angular reversal (TARS) method of knowledge removal from LLMs. The TARS method firstly leverages the LLM in combination with a detailed prompt to aggregate information about a selected concept in the internal representation space of the LLM. It then refines this approximate concept vector to trigger the concept token with high probability, by perturbing the approximate concept vector with noise and transforming it into token scores with the language model head.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsLLaMA
