HFT: Half Fine-Tuning for Large Language Models

Tingfeng Hui; Zhenyu Zhang; Shuohuan Wang; Weiran Xu; Yu Sun; Hua; Wu

arXiv:2404.18466·cs.CL·April 30, 2024·1 cites

HFT: Half Fine-Tuning for Large Language Models

Tingfeng Hui, Zhenyu Zhang, Shuohuan Wang, Weiran Xu, Yu Sun, Hua, Wu

PDF

Open Access 1 Video 5 Reviews

TL;DR

This paper introduces Half Fine-Tuning (HFT), a method that fine-tunes only half of a large language model's parameters to reduce catastrophic forgetting, improve efficiency, and maintain performance across various tasks.

Contribution

HFT is a novel fine-tuning approach that selectively updates half of the model's parameters, effectively mitigating forgetting without altering the model architecture.

Findings

01

HFT significantly reduces forgetting compared to full fine-tuning.

02

HFT achieves comparable or better performance on downstream tasks.

03

HFT reduces training time by approximately 30%.

Abstract

Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities, enabling LLMs to follow natural language instructions or align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge. Inspired by this, we introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks while the other half are frozen to remain previous knowledge. We provide a feasibility analysis from the perspective of optimization and interpret the parameter…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 3Confidence 4

Strengths

1. This work is well-organized and easy to follow. 2. The experimental results are comprehensive and demonstrate the effectiveness of updating a subset of parameters.

Weaknesses

Disadvantages: From the view of a new fine-tuning methods, the design of the method is trivial. Numerous existing studies [1-4] have explored partial optimization (parameter isolation) to improve model performance. The heuristic selection of 50% of freezing parameters does not provide a robust framework for further advancements in the parameter selection domain. The so-called "half fine-tuning" method appears to be a specific instance of freezing selection or parameter isolation. Selection crite

Reviewer 02Rating 5Confidence 5

Strengths

The method is simple, easy to implement, and can be generalized to any LLM. The experiments are sufficient and basically cover SFT and RLHF in addition to pre-training. Compared with full-parameter fine-tuning, the method proposed in this paper achieves better or comparable results.

Weaknesses

The method of freezing a portion of parameters during fine-tuning is relatively common, and numerous studies have already validated the effectiveness of this freezing method. As a result, the proposed method appears a little incremental and lacks sufficient innovation. Furthermore, the effectiveness of the proposed method in the paper largely stems from the fact that the dataset size of each sub-task is relatively small. Fine-tuning a portion of parameters makes the model less prone to overfitti

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper is well-written and easy to follow 2. The proposed method is simple and effective 3. The experiments cover different settings

Weaknesses

1. Additional LLMs should be investigated to demonstrate the generalization ability of the proposed method. 2. In the continual learning setting, how does the proposed method compare to the baseline O-LoRA (Orthogonal Subspace Learning for Language Model Continual Learning)?

Reviewer 04Rating 3Confidence 4

Strengths

The method is simple. The paper is easy to follow.

Weaknesses

1. **Lack of Distinct Advantage in PEFT Methodology** - The paper proposes HFT as a new PEFT method but does not convincingly demonstrate its superiority over established alternatives like LoRA, Adapters, or BitFit. While the authors claim that "HFT allows LLMs to acquire new abilities while retaining and utilizing previously learned knowledge," this characteristic is inherent to all PEFT methods, not unique to HFT. 2. **Positioning of the "Half-Reset" Technique** - The paper introduce

Reviewer 05Rating 3Confidence 4

Strengths

(1) The proposed method is simple and effective. (2) The forgetting problem is an important problem and the authors offers a solution that tackles this problem.

Weaknesses

(1) The proposed method is too simple, too heuristic without novelties. Although the authors attempt to provide some theoretical explanations on page four, they are not convincing. (2) Why HFT outperforms FFT? Because the number of trainable parameters of HFT is less than FFT, I think FFT could be deemed as an upper bound. In my understanding, the experiments are a trade-off of balancing downstream tasks and pre-training tasks. However, for many SFT-ed checkpoints, people would care more about

Videos

HFT: Half Fine-Tuning for Large Language Models· underline

Taxonomy

TopicsTopic Modeling

MethodsALIGN