Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in   Task Arithmetic

Ruochen Jin; Bojian Hou; Jiancong Xiao; Weijie Su; Li Shen

arXiv:2407.07089·cs.LG·January 30, 2025

Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic

Ruochen Jin, Bojian Hou, Jiancong Xiao, Weijie Su, Li Shen

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a method to improve weight disentanglement in task arithmetic by fine-tuning only the attention modules in transformers, reducing interference among tasks and enhancing model efficiency.

Contribution

The study demonstrates that fine-tuning only attention modules significantly enhances weight disentanglement in task arithmetic, offering an efficient alternative to NTK linearization.

Findings

01

Attention modules exhibit kernel behavior.

02

Fine-tuning attention modules improves weight disentanglement.

03

Representation modules are crucial for disentanglement.

Abstract

In recent years, task arithmetic has garnered increasing attention. This approach edits pre-trained models directly in weight space by combining the fine-tuned weights of various tasks into a unified model. Its efficiency and cost-effectiveness stem from its training-free combination, contrasting with traditional methods that require model training on large datasets for multiple tasks. However, applying such a unified model to individual tasks can lead to interference from other tasks (lack of weight disentanglement). To address this issue, Neural Tangent Kernel (NTK) linearization has been employed to leverage a "kernel behavior", facilitating weight disentanglement and mitigating adverse effects from unrelated tasks. Despite its benefits, NTK linearization presents drawbacks, including doubled training costs, as well as reduced performance of individual models. To tackle this problem,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic· slideslive

Taxonomy

TopicsParallel Computing and Optimization Techniques · Neural Networks and Applications · Matrix Theory and Algorithms

MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer