TuCo: Measuring the Contribution of Fine-Tuning to Individual Responses of LLMs
Felipe Nuti, Tim Franzmeyer, Jo\~ao Henriques

TL;DR
This paper introduces TuCo, a method to quantify how fine-tuning influences individual responses of large language models by decomposing responses into pre-training and fine-tuning components, enabling detailed analysis of model behavior and safety.
Contribution
The paper presents a novel method for measuring the contribution of fine-tuning to individual LLM outputs using hidden state analysis and theoretical decomposition, advancing understanding of fine-tuning effects.
Findings
TuCo can steer model behavior by scaling fine-tuning components.
Attenuating fine-tuning effects reduces vulnerability to adversarial attacks.
TuCo correlates with safety and attack success in LLMs.
Abstract
Past work has studied the effects of fine-tuning on large language models' (LLMs) overall performance on certain tasks. However, a quantitative and systematic method for analyzing its effect on individual outputs is still lacking. Here, we propose a new method for measuring the contribution that fine-tuning makes to individual LLM responses, assuming access to the original pre-trained model. Our method tracks the model's intermediate hidden states, providing a more fine-grained insight into the effects of fine-tuning than a simple comparison of final outputs from pre-trained and fine-tuned models. We introduce and theoretically analyze an exact decomposition of any fine-tuned LLM into a pre-training component and a fine-tuning component. Empirically, we find that model behavior and performance can be steered by up- or down-scaling the fine-tuning component during the forward pass.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
