Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating   Attention Head Activation Patterns

Yang Zhao; Li Du; Xiao Ding; Kai Xiong; Ting Liu; Bing Qin

arXiv:2409.15820·cs.LG·October 21, 2024

Supervised Fine-Tuning Achieve Rapid Task Adaption Via Alternating Attention Head Activation Patterns

Yang Zhao, Li Du, Xiao Ding, Kai Xiong, Ting Liu, Bing Qin

PDF

Open Access

TL;DR

This paper investigates how supervised fine-tuning (SFT) enables large language models to adapt rapidly to complex tasks by analyzing attention head activation patterns, revealing key mechanisms and proposing improvements.

Contribution

It introduces a gradient-based method to dissect SFT, uncovering how attention heads activate and combine for task adaptation, and demonstrates ways to improve SFT efficiency.

Findings

01

LLMs activate task-specific attention heads during SFT

02

Activation patterns for complex tasks are combinations of basic patterns

03

Small parameter changes significantly impact activation patterns

Abstract

LLMs' performance on complex tasks is still unsatisfactory. A key issue is that presently LLMs learn in a data-driven schema, while the instructions about these complex tasks are both scarce and hard to collect or construct. On the contrary, a prominent phenomenon is that LLMs can learn rather fast on simpler tasks with adequate prior knowledge captured during pretraining stage. Thus, if the prerequisite and mechanism of such rapid generalization could be elucidated, it could enhance the efficiency and effectiveness of the LLM's ability to learn complex tasks. Thus, in this paper, we employ a gradient-based method, to dissect the process that the SFT process adapts LLMs to downstream tasks via the perspective of attention patterns. We find that: (1) LLMs selectively activate task-specific attention heads during SFT; (2) activation patterns for complex tasks are combinations of basic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInteractive and Immersive Displays

MethodsSoftmax · Attention Is All You Need · Shrink and Fine-Tune