Learning Dynamics of LLM Finetuning
Yi Ren, Danica J. Sutherland

TL;DR
This paper investigates the learning dynamics of large language models during finetuning, revealing how influence accumulates and explaining phenomena like hallucination reinforcement and the squeezing effect, offering insights for improved alignment.
Contribution
It introduces a unified framework to analyze LLM finetuning dynamics, explaining hallucination strengthening and the squeezing effect, and proposes a simple method to enhance alignment.
Findings
Hallucination types are reinforced after finetuning.
The squeezing effect explains decreased output quality in prolonged DPO.
Insights lead to a simple method for better alignment.
Abstract
Learning dynamics, which describes how the learning of specific training examples influences the model's predictions on other examples, gives us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during different types of finetuning, by analyzing the step-wise decomposition of how influence accumulates among different potential responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. In particular, we propose a hypothetical explanation of why specific types of hallucination are strengthened after finetuning, e.g., the model might use phrases or facts in the response for question B to answer question A, or the model might keep repeating similar simple phrases when generating responses. We…
Peer Reviews
Decision·ICLR 2025 Oral
1. They motivate and ground the study of training dynamics for LLMs well and highlight key difficulties in applying standard influence function style analyses to LLM finetuning and preference optimization, a highlight of the preliminary material. 2. The formalism and analysis in Section 3 is generally clear and well written. Figures 2 and the bullet point enumeration in 3.3 are concise presentations of the core claims of the analysis. 3. The experimental design of the empirical verification sect
1. The clarity of Sec 3.1 when discussing the causal masking and teacher-forced production of the full next-token logits set could be improved, though this is admittedly tricky. It is possible that for some readers, a diagram of the matrix structure here could be helpful since most papers don't tackle the more complex formulation of the influence problem and so readers may not be clear on it. (The reviewer imagines a teacher-forced causally masked model forward on input and label sequences X,Y
1) The paper has strong theoretical and experimental backing for their analysis. 2) First paper to propose a framework extending learning dynamics to LLMs.
1) Did not find any specific weaknesses.
The paper is overall well written, clearly explains the motivation and the approach. The use of simple "pedagogical" examples also helps in communicating the main message of the paper. The main contribution seems to be in the application of the theory to the different loss functions used in LLM finetuning, making explicit use of the "decomposition" of the effective learning dynamics into 3 different terms, and identify which one directly depends on the loss function being used. This is demonstr
There are two major limitations here that, while are (somewhat) acknowledged by the authors, can benefit from a more careful discussion and/or analysis. 1. The first is the obvious limitation that the entire analysis is being done under the assumption that the "feature map" is held fixed, and only the classification layer is changing during learning, but the extent to which this assumption holds in practice is not being evaluated at all. Without such an evaluation (for example, either by quanti
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Fuzzy Logic and Control Systems · Natural Language Processing Techniques
MethodsDirect Preference Optimization
