Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization
Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan Xue, Longhui Yu,, Haiwen Feng, Zhen Liu, Juyeon Heo, Songyou Peng, Yandong Wen, Michael J., Black, Adrian Weller, Bernhard Sch\"olkopf

TL;DR
This paper introduces BOFT, a parameter-efficient orthogonal finetuning method using butterfly structures, enabling effective adaptation of large models to downstream tasks with fewer trainable parameters.
Contribution
The paper proposes a novel butterfly-based orthogonal parameterization for finetuning, subsuming OFT, and demonstrates its effectiveness across vision, language, and diffusion models.
Findings
BOFT reduces the number of trainable parameters significantly.
BOFT achieves comparable or better performance than existing methods.
Extensive experiments validate BOFT's efficiency and versatility.
Abstract
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a…
Peer Reviews
Decision·ICLR 2024 poster
The paper provides a good review of other methods employing butterfly parameterisations in the literature and the relative benefits such a parameterisation provides. The description of the parameterisation in Section 3 is thorough and well paced. This appears to be an original and valuable way to apply the parameterisation in deep learning. Expressivity analysis in Section 5 is a valuable contribution, identifying an advantage over OFT and making a strong argument for better performance over Lo
The information transmission framework listed as the first contribution of the paper is described as novel but I would argue this is connectivism. A similar figure can be found in every book on deep learning, for example opening Goodfellow's [Deep Learning book][goodfellow] there is a figure on page 170 that describes a bipartite connectivist diagram like this. It's possible I am missing something though. OFT is described well, but it appears too late. It is mentioned early in the paper but a q
The paper for the first time proposes BOFT, a PEFT method inspired by orthogonal fine-tuning and the Cooley-Tukey algorithm, and shows the performance of BOFT on various applications from computer vision to natural language processing.
The ability to switch different tasks efficiently would be lost in BOFT due to the fact that BOFT is based on multiplication whereas LoRA is based on addition. For the MMLU dataset, the performance of Llama 2 13B and/or Llama 2 70B should be given, because PEFT methods are designed for fine-tuning large language models. The performance comparison of Llama 2 7B seems not to be enough. The ablation study of $b$ and $m$ of BOFT seems to be necessary because all different $b$ and $m$ of BOFT a
Firstly, I appreciate the work including a good amount of experiments to show the effectiveness of BOFT. The paper is well written and the appendix provides significant experimental details. Application to SAM is interesting and useful. The motivation behind Orthogonal Butterfly technique is nicely explained.
One major concern I have is, in comparison with LoRA, I can immediately see the benefit due to a reduction in #Params but with OFT with comparable params, I see a marginal performance gain. I am not sure if OFT will be able to outperform the proposed method with some hyperparameter fine-tuning. OFT-SAM is missing in Table 5, and I would recommend authors to add the results for completion. I also have some novelty concerns with the work considering OFT, since the proposed method doesn't provide a
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Advanced Neural Network Applications
MethodsDiffusion
