DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

Zhengxiang Shi; Aldo Lipani

arXiv:2309.05173·cs.CL·February 20, 2024·1 cites

DePT: Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

Zhengxiang Shi, Aldo Lipani

PDF

Open Access 2 Repos 3 Reviews

TL;DR

DePT introduces a decomposed prompt tuning method that reduces memory and time costs while improving performance in parameter-efficient fine-tuning for large language and vision-language models.

Contribution

DePT decomposes soft prompts into shorter prompts and low-rank matrices, enhancing efficiency without increasing trainable parameters.

Findings

01

Outperforms state-of-the-art PEFT methods on 23 NLP and VL tasks.

02

More efficient as model size increases.

03

Seamlessly integrates with few-shot learning and various architectures.

Abstract

Prompt tuning (PT), where a small amount of trainable soft (continuous) prompt vectors is affixed to the input of language models (LM), has shown promising results across various tasks and models for parameter-efficient fine-tuning (PEFT). PT stands out from other PEFT approaches because it maintains competitive performance with fewer trainable parameters and does not drastically scale up its parameters as the model size expands. However, PT introduces additional soft prompt tokens, leading to longer input sequences, which significantly impacts training and inference time and memory usage due to the Transformer's quadratic complexity. Particularly concerning for Large Language Models (LLMs) that face heavy daily querying. To address this issue, we propose Decomposed Prompt Tuning (DePT), which decomposes the soft prompt into a shorter soft prompt and a pair of low-rank matrices that are…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

S1: The idea is very simple and leads to decent improvements over the baseline methods. Also, the paper is very easy to read and understand. S2: The experiments are thorough enough, however, I have some mild additional suggestions that might make the experimental section more complete.

Weaknesses

W1: Some of the important baseline methods like IA3 are missing. See questions below. W2: The idea is interesting, however some more intuition on why this works might strengthen the paper.

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

To the best of the reviewer’s knowledge, the method proposed in this paper DeFT is novel. The authors also provide solid intuitions and reasoning for this method. Besides the method constructions, the experiments are comprehensive. I also appreciate the authors’ efforts in organizing the anonymous project code that covers the experiments.

Weaknesses

The key contribution of DePT lies in that it is both optimizing a soft context as long as the vocabulary in an efficient manner. The decomposition idea, although novel in its current form, is incremental to current PEFT methods. There is also existing work (e.g. [1]) that explored tuning subsets of vocabularies as a way of PEFT. That being said, DePT still has the advantage of efficient vocabulary tuning. The 20% efficiency advantage also is only revealed with one soft prompt length of 100. It w

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

**Paper quality.** The paper is well written. The organization of the paper is clear and well thought out. I enjoyed reading the paper. **Extensive experiments.** The paper extensively experiments with improved results compared to prompt tuning while being more efficient during training and inference. The authors have done a great job comparing the work with other recent methods in the literature and show that DEPT outperforms them on GLUE and SuperGLUE. They further provide evidence of their

Weaknesses

**The architecture is not well motivated.** The architecture appears to be a combination of prompt tuning and LoRA. But, unlike LoRA, DEPT still suffers from prompt length compared to architectures at inference time. While DEPT can also achieve the same inference speed as the base model, like LoRA, when the prompt length is 0, in Figure 3, we see that the performance is about 20 points below the DEPT performance reported in Table 1. Furthermore, decomposing the prompts does not offer any concep

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications