A Comparative Analysis of LLM Adaptation: SFT, LoRA, and ICL in Data-Scarce Scenarios
Bernd Bohnet, Rumen Dangovski, Kevin Swersky, Sherry Moore, Arslan Chaudhry, Kathleen Kenealy, Noah Fiedel

TL;DR
This paper compares SFT, LoRA, and ICL methods for adapting Large Language Models in data-scarce scenarios, highlighting LoRA as the most balanced approach for skill learning and knowledge retention.
Contribution
It provides a comprehensive comparison of three adaptation techniques, clarifying their trade-offs and offering guidance for selecting the best method based on task requirements.
Findings
LoRA balances skill acquisition and knowledge preservation effectively.
SFT excels at skill learning but causes catastrophic forgetting.
ICL is good for factual knowledge but limited with complex skills.
Abstract
The remarkable capabilities of Large Language Models (LLMs) often need to be tailored for specific applications, requiring the integration of new knowledge or the acquisition of new skills. While full fine-tuning is a powerful adaptation method, it is computationally expensive and can lead to a degradation of general reasoning abilities, a phenomenon known as catastrophic forgetting. A range of alternative techniques exists, each with its own trade-offs. In-Context Learning (ICL) is fast but limited by context length, while Parameter-Efficient Fine-Tuning (PEFT) methods like Low-Rank Adaptation (LoRA) offer a middle ground by minimizing parameter changes. However, the challenge of catastrophic forgetting persists, raising questions about the best adaptation strategy for a given task. This paper presents a comparative analysis of Supervised Finetuning (SFT), LoRA, and ICL in data-scarce…
Peer Reviews
Decision·Submitted to ICLR 2026
* The presentation and organization of the paper are clear and easy to follow. * The results effectively show the strengths and weaknesses of the LLM adaptation methods on different types of tasks.
* The main weakness of the paper is the lack of comprehensive experiments. The experiments only consider a single model, Gemma 3, which substantially limits the generalizability of the findings. For example, the authors observe a severe performance degradation on the NQ task. Would a stronger model exhibit a similar trend? Experiments conducted on a single model make it difficult to draw convincing conclusions. * The authors should consider including more recent and advanced variants for each me
This paper systematically investigates the forgetting problem across major LLM adaptation methods, offering extensive empirical evidence to support its conclusions. Through rigorous experiments and detailed comparisons, it provides a clear and data-driven understanding of how different techniques—SFT, LoRA, and ICL—balance learning new skills and retaining prior knowledge. The abundance of quantitative results strengthens the paper’s claims and establishes a solid empirical foundation for future
1. The research does not present substantially new findings beyond what is already known about catastrophic forgetting in large language models. While the empirical comparisons are thorough, the paper does not clearly explain how its results extend or challenge existing understanding from prior studies. [1,2] 2. The paper does not specify which Gemma 3 model variant was used, even though multiple versions exist (e.g., 2B, 9B, 27B). This omission makes it difficult to reproduce or contextualize
The paper is well-written and clearly structured, making it easy to follow. The comparative setup (SFT vs. LoRA vs. ICL) is useful, especially given the current interest in efficient model adaptation. The experiments are broad (13 benchmarks across skill and knowledge tasks) and highlight consistent empirical trends.
* Lack of novelty. This paper summarizes known trade-offs (e.g., SFT forgets more, LoRA forgets less) without introducing new analytical insights or theoretical framing. * The experiments cover several benchmarks, but they stay pretty descriptive. It would be nice to see more diagnostic analysis, like why LoRA fails in very low-data cases or how rank affects stability. * The ICL results are somewhat superficial. The authors only test few-shot performance without exploring prompt design, ordering
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
