FTFT: Efficient and Robust Fine-Tuning by Transferring Training Dynamics
Yupei Du, Albert Gatt, Dong Nguyen

TL;DR
FTFT introduces a transfer-based fine-tuning method that enhances robustness and reduces training costs for large language models by leveraging transferable training dynamics and early stopping.
Contribution
It proposes a novel fine-tuning approach that transfers training dynamics from smaller models, improving efficiency and robustness over traditional methods like dataset cartography.
Findings
FTFT reduces training cost by up to 50%.
Training dynamics are highly transferable across models.
FTFT achieves better robustness than ERM.
Abstract
Despite the massive success of fine-tuning Pre-trained Language Models (PLMs), they remain susceptible to out-of-distribution input. Dataset cartography is a simple yet effective dual-model approach that improves the robustness of fine-tuned PLMs. It involves fine-tuning a model on the original training set (i.e. reference model), selecting a subset of important training instances based on the training dynamics, and fine-tuning again only on these selected examples (i.e. main model). However, this approach requires fine-tuning the same model twice, which is computationally expensive for large PLMs. In this paper, we show that (1) training dynamics are highly transferable across model sizes and pre-training methods, and that (2) fine-tuning main models using these selected training instances achieves higher training efficiency than empirical risk minimization (ERM). Building on these…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. Simple and clear idea. 2. The motivation and reasoning are well explained.
1. The proposed method has limited novelty compared to Data map. 2. The result is mixed. Out of the experiments in Table 2, only half of them show successful transfer. Suggesting the scale of the reference model still needs to be relatively close to the Main model. Besides, even in the remaining rows, the result is inconsistent across datasets. 3. Cost saving is only in fine-tuning time rather than inference time. However, for LLMs, fine-tuning cost is much less of an issue than pertaining or in
The strength of the paper lies in its easy-to-understand explanation of the algorithm. The authors begin with a clear description of the existing literature on the data map methods and the underlying issue of these methods. With a proposed hypothesis of using a small model to provide the necessary data map, the authors test multiple candidates that can act as the reference small model. Finally, extensive experimentation shows the efficacy of their method on multiple ID-OOD dataset pairs.
I have a few questions regarding the experimental setup. (a) How efficient is FTFT compared to ERM in terms of total flops? Since FTFT first trains a small reference model to select the ambiguous set of examples, it has to incur the flop necessities of training the small reference model. A rough estimate of the flop counts for both methods will be useful. (b) How does FTFT perform when compared to existing algorithms that aim to improve the OOD performance of trained models? Examples of such
1. The authors conduct a systematic investigation of various reference model sizes and compare against reference models trained with an alternative discriminative pretraining method. 2. The authors demonstrate that using smaller reference models and training on the resulting DataMap does not result in performance reduction as compared with an ERM baseline trained over the entire dataset; and results in improved performance on OOD robustness datasets.
1. One of the primary contributions is the sample efficiency of models trained on a smaller data map selected via ambiguity. However, there is limited comparison or discussion of related work on sample efficient methods of training such as curriculum learning and dataset pruning [1, 2] . 2. The DataMap selection criteria is limited to example ambiguity -- and does not compare against other criteria such as "hard-to-learn", example forgetability [3] 3. Evaluations are limited to finetuning of m
Although the novelty is limited, ablations and benchmarking shown are quite detailed. Presentation is very clear. Related literature is very well-reviewed.
Contribution is extremely limited over Swayamdipta et al. (2020) who proposed the original method of data maps combined with the work of Sar-Shalom & Schwartz (2023) who demonstrated that a DM constructed by ELECTRALarge can be used to improve the robustness of DeBERTaV3Large. This work reads more like a tech report rather than an ICLR paper. Insights are practically useful, but does not go beyond systematic benchmarking.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMuscle activation and electromyography studies
