LoRA ensembles for large language model fine-tuning
Xi Wang, Laurence Aitchison, Maja Rudolph

TL;DR
This paper introduces LoRA ensembles, a parameter-efficient method for fine-tuning large language models that improves uncertainty estimation and accuracy without significantly increasing computational costs.
Contribution
The paper proposes using Low-Rank Adapters (LoRA) to create large, memory-efficient ensembles for LLM fine-tuning, enhancing uncertainty quantification and predictive performance.
Findings
LoRA ensembles improve calibration and accuracy.
They require minimal additional parameters.
Ensembles outperform single models in uncertainty estimation.
Abstract
Finetuned LLMs often exhibit poor uncertainty quantification, manifesting as overconfidence, poor calibration, and unreliable prediction results on test data or out-of-distribution samples. One approach commonly used in vision for alleviating this issue is a deep ensemble, which constructs an ensemble by training the same model multiple times using different random initializations. However, there is a huge challenge to ensembling LLMs: the most effective LLMs are very, very large. Keeping a single LLM in memory is already challenging enough: keeping an ensemble of e.g. 5 LLMs in memory is impossible in many settings. To address these issues, we propose an ensemble approach using Low-Rank Adapters (LoRA), a parameter-efficient fine-tuning technique. Critically, these low-rank adapters represent a very small number of parameters, orders of magnitude less than the underlying pre-trained…
Peer Reviews
Decision·Submitted to ICLR 2024
- The introduction of LoRA for ensembling is a unique approach, particularly for large models like LLMs. This could be a useful exploration of how to ensemble such massive models. - LoRA ensembles, whether used independently or in conjunction with other techniques, demonstrate enhancements in both prediction accuracy and the quantification of uncertainty. - The observation that regularization may benefit calibration over just the improvement from ensembling can hold practical value.
- The paper lacks a comprehensive survey of existing ensemble methods, and it does not adequately discuss or compare with related works such as [1,2,3,4,6] in the literature. - The focus of the paper is only on prediction ensembles, which neglects the important weight ensemble methods [1,3,4,6]. The paper argues that maintaining an ensemble of, for instance, 5 LLMs in memory can be challenging in certain scenarios. However, it's worth noting that weight ensembles require the maintenance of just
(1) The paper is well-written and easy to understand. (2) This paper analyses potentials of LoRA ensemble with various techniques, such as regulizers, Dropout, weight decay. (3) The ablation study of LoRA with randomness is interesting.
(1) One major concern is the idea is very naive and straightforward. The performance improvement of deep ensemble is already well-known to the community, and it is in no way surprising that we can combine LoRA with ensemble to improve the performance, uncertainty, etc. (2) Another concern is that no computational costs is reported in this paper. I understand the inference costs of LoRA ensemble is much lower than traditional finetuning ensemble, but it is good to demonstrate this. (3) The ti
- The method is built on top of well-known results that model ensembles can lead to more accurate and calibrated predictions. - LoRA ensemble alleviates the need to finetune and update the entire model which is computationally prohibitive. - Experiment results show that LoRA ensemble does lead to more accurate results as well as reduced calibration error.
- It is rather straightforward to consider LoRA finetuning for ensembling. The technical novelty of the proposed method is a bit limited. - There are several relevant and stronger baselines not considered in the experiments, including calibration for in-context learning [1], and self-consistency [2], both of which shows decent improvements on prediction accuracy. - The current set of experiments considered is limited to multiple choice questions (predicting only a single token). While the method
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques
