ColA: Collaborative Adaptation with Gradient Learning
Enmao Diao, Qi Le, Suya Wu, Xinran Wang, Ali Anwar, Jie Ding, Vahid, Tarokh

TL;DR
ColA introduces a parameter-free, model-agnostic fine-tuning method that reduces computational costs by offloading gradient computation, making FTaaS more efficient without sacrificing performance.
Contribution
The paper proposes ColA, a novel gradient learning approach that decouples gradient computation, enabling cost-effective fine-tuning suitable for multiple users and devices.
Findings
ColA performs on par or better than PEFT methods on benchmarks.
ColA reduces computational costs by offloading gradient calculations.
Theoretical analysis supports ColA's effectiveness and efficiency.
Abstract
A primary function of back-propagation is to compute both the gradient of hidden representations and parameters for optimization with gradient descent. Training large models requires high computational costs due to their vast parameter sizes. While Parameter-Efficient Fine-Tuning (PEFT) methods aim to train smaller auxiliary models to save computational space, they still present computational overheads, especially in Fine-Tuning as a Service (FTaaS) for numerous users. We introduce Collaborative Adaptation (ColA) with Gradient Learning (GL), a parameter-free, model-agnostic fine-tuning approach that decouples the computation of the gradient of hidden representations and parameters. In comparison to PEFT methods, ColA facilitates more cost-effective FTaaS by offloading the computation of the gradient to low-cost devices. We also provide a theoretical analysis of ColA and experimentally…
Peer Reviews
Decision·Submitted to ICLR 2024
### Originality * Novelty of ColA: The main innovation in ColA is the ability to compute updates to auxiliary parameters offline. Previous approaches for efficient model adaptation include: Fine-tuning Adapter layers, which place learnable layers in-between existing learnable layers; Low-Rank Adaptation (LoRA), which introduce two low-rank matrices to parametrize the updates of pretrained weigh matrices; and Prefix Tuning, which prepends sequence of learnable tokens as input to the network. Conc
Weaknesses * Not clear how to align proposed method with other optimization strategies (i.e., beyond gradient descent) * Still need to forward/backward propagate the model K times for K users; i.e., the decoupled gradient computation and adaptation does not address this issue * To compute the change in hidden state at some layer $m < M$, you need to have first computed the hidden state at layer $m-1$. Since this computation is carried out on the GPU (server); it appears as though you either need
- The proposed method is simple and easy to implement. - Efficient fine-tuning is an important topic.
- The motivation and advantages of moving gradient update to CPU are unclear. - The relationship between the proposed gradient learning and collaborative adaption is unclear. - The experiments do not include collaborative adaption.
* **Gradient Learning is Novel**: the idea of decoupling the calculation of the gradient of the model weights and the gradient of hidden features seems novel. The paper also theoretically demonstrates the equivalence between the proposed decoupled update and the conventional update rules.
* **Why is the method parameter-free?**: even though the proposed model offloads the update of adapter weights to a different device, it does not make it parameter-free. It is not very convincing to claim the fine-tuning method to be parameter-free. * **Actual memory footprint not clear**: while the model claims to save storage on the main device, e.g., the GPU, the paper does not report the actual memory footprint during the forward and backward passes on the GPU. Compared to the small number
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAging and Gerontology Research · Technology Use by Older Adults
Methodstravel james · COLA
