MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning
Yupeng Chen, Senmiao Wang, Yushun Zhang, Zhihang Lin, Haozhe Zhang, Weijian Sun, Tian Ding, Ruoyu Sun

TL;DR
MoFO is a novel optimizer for fine-tuning large language models that selectively updates parameters with high momentum to reduce knowledge forgetting without requiring access to pre-training data.
Contribution
MoFO introduces a momentum-based selective update algorithm that mitigates forgetting during fine-tuning of LLMs without needing pre-training data.
Findings
MoFO achieves comparable performance to standard fine-tuning methods.
MoFO effectively reduces knowledge forgetting in LLMs.
Theoretical convergence guarantees support MoFO's effectiveness.
Abstract
Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. Typically, LLMs are first pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget some knowledge acquired in the pre-training stage, leading to a decline in general capabilities. Existing approaches to mitigate forgetting often rely on access to pre-training data, which may be unavailable in many real-world scenarios--such as fine-tuning checkpoint-only open-source LLMs. To address this challenge, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). MoFO is an extension of greedy block coordinate descent (BCD) methods: in each iteration, MoFO only updates the model parameters with the largest momentum magnitudes, while keeping all other parameters fixed. MoFO achieves similar fine-tuning…
Peer Reviews
Decision·Submitted to ICLR 2025
1. This paper provides a comprehensive and reasonable proof of the convergence. 2. The effectiveness of the method is verified with LLMs of different sizes. 3. The convergence direction of MoFO and baselines is visualized using loss landscape, proving that MoFO converges to a closer point.
In Section 4, the loss landscapes for HFT and LoRA were not reported.
1. The paper deals with an important problem in LLM application and gives a practical solution with fair theoretical support. 2. The illustation and analysis of MoFO strategy is clear and in detail. 3. Authors supported their method with experiments covering various tasks and settings.
1. The discussion of this paper did not cover other algorithms that are non replay and non regularization, and therefore doesn't give enough support for the motivation and novelty of their work. (including but not limited to: https://arxiv.org/pdf/2309.06256, https://arxiv.org/pdf/2404.10306, https://arxiv.org/abs/2302.03241 etc.) A brief literature review section might be added specifically comparing MoFO to these other approaches, highlighting key differences and potential advantages. 2. The
1. The paper provides a simple but elegant approach to mitigating forgetting in LLM fine-tuning. 2. Empirical results are convincing. 3. The paper is well-written with illustrative figures.
1. Using the default partitioning of model parameters as implemented in PyTorch is quite an ad-hoc choice. Is there any better or more principled way to partition the model parameters? 2. The convergence result is good, but it does not show the advantages of MoFO over the traditional Adam method, especially in mitigating forgetting.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNuclear reactor physics and engineering · Magnetic confinement fusion research · Advancements in Photolithography Techniques
