Mechanistic Behavior Editing of Language Models

Joykirat Singh; Subhabrata Dutta; Tanmoy Chakraborty

arXiv:2410.04277·cs.CL·October 8, 2024

Mechanistic Behavior Editing of Language Models

Joykirat Singh, Subhabrata Dutta, Tanmoy Chakraborty

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces TaRot, a novel method for task adaptation in large language models that uses learnable rotation matrices optimized via Bayesian Optimization, significantly improving performance on classification and generation tasks.

Contribution

The paper proposes TaRot, a new approach that intervenes in neural circuitries of LLMs with learnable rotations optimized by Bayesian methods for better task adaptation.

Findings

01

TaRot improves zero-shot performance by 23.81%.

02

TaRot enhances few-shot performance by 11.15%.

03

Method is effective across multiple tasks and model sizes.

Abstract

Large Language Models trained on web-scale text acquire language generation abilities that can solve a wide range of tasks, particularly when task knowledge is refined into the generative prior using in-context examples. However, spurious features learned from noisy data hinder their generalizability. Supervised finetuning can introduce task specificity, but introduce data inefficiency. Prior studies indicate that (i) noisy neural circuitries coexist with generalizable ones within LLMs, and (ii) finetuning typically enhances (or suppresses) existing abilities without introducing newer ones. Building upon these, we propose TaRot, a novel method for task adaptation. TaRot intervenes in the neural circuitries using learnable rotation matrices that are optimized using Bayesian Optimization, on labelled samples in the order of standard few-shot prompting examples. Experiments on multiple…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 2

Strengths

Steering pretrained LLMs with lightweight and interpretable intervention is of massive importance.

Weaknesses

The central claim of TaRot being a "tuning-free" method achieving effects comparable to finetuning requires further scrutiny. While TaRot avoids modifying the original LLM weights directly, it introduces and optimizes a set of parameters (rotation matrices) for each task. This approach parallels other parameter-efficient adaptation methods like adapter-based techniques, which also modify model behavior without full finetuning, albeit in a gradient-based manner. Thus, the distinction of TaRot

Reviewer 02Rating 6Confidence 4

Strengths

1. TaRot offers a new perspective on task adaptation by directly editing model behavior through mechanistic interventions, which is a significant departure from traditional SFT. 2. The method is highly data-efficient and designed to work with LLMs of varying sizes, demonstrating its scalability and versatility. 3. The paper provides extensive experimental results, showing consistent improvements in performance across multiple tasks and models.

Weaknesses

1. While the method is innovative, the complexity of understanding and implementing rotation matrices in neural circuits might be a barrier for some researchers and practitioners. 2. TaRot's effectiveness is dependent on the quality of pretraining, which means it may not perform well if the base model has significant limitations. This also limits its applicability in scenarios where models need to be fine-tuned for new domains. 3. The paper could benefit from a more detailed comparison with SFT

Reviewer 03Rating 3Confidence 3

Strengths

1. This work proposes a new mechanistic intervention method for model adaptation without extensive fine-tuning, highlighting its potential significance. 2. The methodology is well-designed, combining mathematical rigor with empirical validation across diverse tasks. 3. This work demonstrates improvement across multiple LLMs on classification and generation tasks. 4. The paper clearly articulates the TaRot concept and its implications, offering comprehensive descriptions of experiments and res

Weaknesses

1. There is a lack of comprehensive comparison among different editing methods proposed in prior works. Relying on a single method based on singular values may not provide sufficient insight. 2. Current LLMs often exceed 8 billion parameters such as Llama 3 70B, yet this work does not include experiments on models larger than 8B. 3. The proposed method does not outperform the baseline on the primary metric shown in Table 3. 4. The impact of rotation editing on other previously learned knowled

Code & Models

Repositories

joykirat18/tarot
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Reinforcement Learning in Robotics