Theoretical Insights into Fine-Tuning Attention Mechanism: Generalization and Optimization
Xinhao Yao, Hongjin Qian, Xiaolin Hu, Gengze Xu, Wei Liu, Jian Luan, Bin Wang, Yong Liu

TL;DR
This paper provides theoretical insights into fine-tuning attention mechanisms in large language models, highlighting the importance of selectively tuning certain matrices and assigning different learning rates to improve efficiency and performance.
Contribution
It introduces a novel understanding of the importance of tuning specific attention matrices and learning rate strategies, leading to more efficient fine-tuning methods for LLMs.
Findings
Optimizing the value matrix W_v yields better performance than optimizing the key matrix W_k.
Fine-tuning only W_q and W_v can match or surpass full fine-tuning results.
Higher learning rates for W_v accelerate convergence and enhance performance.
Abstract
Large Language Models (LLMs), built on Transformer architectures, exhibit remarkable generalization across a wide range of tasks. However, fine-tuning these models for specific tasks remains resource-intensive due to their extensive parameterization. In this paper, we explore two remarkable phenomena related to the attention mechanism during the fine-tuning of LLMs (where , , and denote the weights of the query, key, and value layers, respectively). The first phenomenon, termed "Unequal Importance of Attention Matrices", highlights the impact of fine-tuning different weight matrices. It shows that optimizing the matrix yields significantly better performance than optimizing the matrix. Fine-tuning only the and matrices is computationally efficient while delivering results comparable…
Peer Reviews
Decision·Submitted to ICLR 2025
- **Importance of Understanding Attention Mechanisms During Fine-Tuning**: The challenge of gaining a deeper understanding of the attention mechanism during fine-tuning is a critical one. The approach developed in this paper has the potential to serve as a plug-and-play solution for achieving improved accuracy-efficiency trade-offs in LLM fine-tuning. - **Empirical and Theoretical Contributions**: This paper offers both empirical and theoretical analyses to elucidate the behavior of the attenti
- **Generalizability of the Proposed Approach**: The primary concern is the generalizability of the proposed approach. Specifically, the authors could enhance the analysis by demonstrating that the proposed method consistently improves LLM performance across diverse scenarios. To this end, it would be beneficial to include performance results under more complex, open-ended generation tasks, such as MT-Bench or comparable challenging benchmarks. Additionally, considering variations in model behav
- Originality: The originality of this paper lies in its focused approach to fine-tuning the attention mechanism of large language models. By exploring the selective fine-tuning of `Wv` and `Wq` matrices, the authors introduce a novel method that challenges the conventional approach of fine-tuning all attention matrices (`Wq`, `Wk`, `Wv`). The theoretical insights into the distinct roles of these matrices, combined with empirical validation, provide a fresh perspective on optimizing the attentio
1. **Lack of Base Model Performance for Each Task**: The paper does not provide the performance of the base model before fine-tuning for each task, making it challenging to evaluate the true effectiveness of the fine-tuning methods. Including these baseline results would help contextualize the improvements made through fine-tuning. 2. **GLUE Evaluation Is Too Simple for LLaMA3.1-8B**: The use of the GLUE benchmark to evaluate the LLaMA3.1-8B model is insufficient, as GLUE tasks are relatively s
The idea of using different local learning rates for Wq and Wv is interesting. As V is timed with attention logits, q and v should have more different gradient distributions than q and k. Using different local learning rates for these two types of matrices is reasonable and worth exploring.
1. The writing and presentation are rather poor. Both sentence level and logical level polishment are suggested for this work. E.g. In line 16~22, "In this paper, we investigate two remarkable phenomena ... with a higher learning rate for the Wv matrix expediting convergence." The ordering of the sentences and the way of structuring the arguments add unnecessary difficulty to reading. This kind of problem happens across the whole manuscript. A suggestion is, use multiple statement sente
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Neural dynamics and brain function · Neural Networks and Applications
MethodsAttention Is All You Need · Dense Connections · Adam · Linear Layer · Residual Connection · Position-Wise Feed-Forward Layer · Label Smoothing · Dropout · Byte Pair Encoding · Absolute Position Encodings
