Targeted Angular Reversal of Weights (TARS) for Knowledge Removal in   Large Language Models

Harry J. Davies; Giorgos Iacovides; Danilo P. Mandic

arXiv:2412.10257·cs.CL·December 17, 2024

Targeted Angular Reversal of Weights (TARS) for Knowledge Removal in Large Language Models

Harry J. Davies, Giorgos Iacovides, Danilo P. Mandic

PDF

Open Access

TL;DR

The paper introduces TARS, a novel method for removing specific knowledge from large language models by reversing internal concept vectors, achieving effective removal across languages with minimal impact on overall performance.

Contribution

TARS is a new, modular technique that precisely targets and reverses internal concept representations in LLMs for effective knowledge removal.

Findings

01

TARS reduces target concept trigger probability to zero with a single edit.

02

Knowledge removal is effective across multiple languages.

03

Minimal impact on overall model performance after multiple concept removals.

Abstract

The sheer scale of data required to train modern large language models (LLMs) poses significant risks, as models are likely to gain knowledge of sensitive topics such as bio-security, as well the ability to replicate copyrighted works. Methods designed to remove such knowledge must do so from all prompt directions, in a multi-lingual capacity and without degrading general model performance. To this end, we introduce the targeted angular reversal (TARS) method of knowledge removal from LLMs. The TARS method firstly leverages the LLM in combination with a detailed prompt to aggregate information about a selected concept in the internal representation space of the LLM. It then refines this approximate concept vector to trigger the concept token with high probability, by perturbing the approximate concept vector with noise and transforming it into token scores with the language model head.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsLLaMA