TL;DR
This paper introduces a parameter-efficient fine-tuning method for foundation models using a factorization of weight matrices into circulant and diagonal matrices, reducing computational complexity while maintaining or improving performance.
Contribution
It proposes a novel factorization approach with circulant and diagonal matrices for efficient fine-tuning, avoiding weight matrix construction and using 1D FFT for speed.
Findings
Achieves comparable or better performance with fewer FLOPs.
Reduces trainable parameters significantly.
Effective across various tasks.
Abstract
Foundation models have achieved tremendous success in different domains. However, their huge computation and storage complexity make these models difficult to fine-tune and also less applicable in practice. Recent study shows training in Fourier domain can be an effective fine-tuning method in terms of both model performance and number of training parameters. In this work, we propose to further reduce the complexity by the factorization through the product of interleaved circulant and diagonal matrices. In addition, we address the case of non-square fine-tuning weights by partitioning the circulant matrix into blocks. Our method avoids the construction of weight change matrix and utilizes 1D fast Fourier transform (FFT) instead of 2D FFT. Experimental results show that our method achieves similar or better performance across various tasks with much less floating-point operations (FLOPs)…
Peer Reviews
Decision·Submitted to ICLR 2025
1. CDVFT is more efficient than LoRA and FourierFT by using matrix factorization and 1D FFT to effectively reduce the number of parameters and FLOPs. 2. CDVFT exploits circulant and diagonal matrix product properties to provide computational advantages, thereby significantly reducing FLOPs and complexity. 3. Extensive testing on RoBERTa and ViT models confirms that the proposed strategy performs as well or better than baseline techniques, and the results show that it requires less computationa
1. Although the study achieved impressive empirical results, there is a lack of thorough theoretical analysis and evidence to confirm the advantages of the circulant-diagonal factorization method. 2. The reliance of the method on FFT-based operations may limit its generality as it is only applicable to LFMs with specific architectural characteristics. 3. The study did not investigate various hyperparameter combinations beyond the fixed setting, which may affect the robustness of the CDVFT resu
1. The idea of using the product of interleaved circulant and diagonal matrices to represent the weight updates is interesting. 2. The theoretical analysis of computational complexity is good to illustrate the benefit of CDVFT in terms of FLOPs. 3. The technical details are sufficient and friendly to reproduce.
1. The experiments regarding NLP tasks are insufficient. Authors should consider experiments of fine-tuning the latest models such as Llama-3-8B. Moreover, authors should consider the evaluation of CDVFT on the generation tasks. 2. The improvement of computation complexity may not indicate the reduction of end-to-end fine-tuning costs. The factorization can bring exorbitant costs on the kernel launch which might offset the benefit of the computation complexity improvement. 3. The evaluation does
- This paper draws inspiration from matrix decomposition using diagonal and circulant matrices, and integrates the existing idea into the parameter-efficient finetuning problem. - The paper includes a detailed analysis of the number of trainable parameters and training complexity. - Experimental results on various workloads with multiple runs validate the effectiveness of the proposed method.
- The motivation for CDVFT is not convincing. It is unclear if balancing the number of trainable parameters (which affects memory requirements) with training FLOPs (which affects training time) is necessary. In the experiments presented, the number of trainable parameters and required memory are already negligible, making further reduction seem unnecessary. - The paper lacks a clear explanation for the selection of hyperparameters. While it claims “m=2” is sufficient, it does not address whether
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
