Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences

Nikolaos Dimitriadis; Pascal Frossard; Francois Fleuret

arXiv:2407.08056·cs.LG·February 27, 2025

Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences

Nikolaos Dimitriadis, Pascal Frossard, Francois Fleuret

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PaLoRA, a parameter-efficient multi-task learning method that improves scalability, convergence speed, and memory efficiency by using task-specific low-rank adapters and a deterministic preference sampling schedule.

Contribution

PaLoRA combines low-rank adapters with a convex hull parameterization of the Pareto front and a deterministic preference sampling schedule to enhance multi-task learning efficiency and scalability.

Findings

01

Outperforms state-of-the-art MTL and PFL baselines.

02

Reduces memory overhead by 23.8-31.7 times.

03

Scales effectively to large networks.

Abstract

Multi-task trade-offs in machine learning can be addressed via Pareto Front Learning (PFL) methods that parameterize the Pareto Front (PF) with a single model. PFL permits to select the desired operational point during inference, contrary to traditional Multi-Task Learning (MTL) that optimizes for a single trade-off decided prior to training. However, recent PFL methodologies suffer from limited scalability, slow convergence, and excessive memory requirements, while exhibiting inconsistent mappings from preference to objective space. We introduce PaLoRA, a novel parameter-efficient method that addresses these limitations in two ways. First, we augment any neural network architecture with task-specific low-rank adapters and continuously parameterize the PF in their convex hull. Our approach steers the original model and the adapters towards learning general and task-specific features,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 3

Strengths

1. This paper is well-organized and clearly written. 2. The idea of decouple a neural network into a general feature extrator and several task-specific low-rank adapters is reasonable.

Weaknesses

1. The key equation (2) is quite similar to LoRA (Low-Rank Adaptation). And there are many multi-LoRA methods proposed for multi-task learning. However, the authors have missed. Please tell us the benefits of the proposed method compared with the multi-LoRA methods (e.g., [1-4]). 2. Fig.2 is difficult to understand. First, how is $\lambda$ changed over time? Second, how to judge the quality of the mappings from preference to objective space? 3. For Fig.3, why are there fewer or even no compared

Reviewer 02Rating 6Confidence 4

Strengths

1. The proposed method uses low-rank adaptors to encode weight updates in multi-task learning, and is more efficient compared to previous work. Specifically, it achieves over $\times 20$ reduction of memory usage. This is particularly impactful in real-world applications, where large neural networks are typically memory-intensive and costly to scale. 2. The proposed deterministic sampling of preference, i.e., $\lambda$, seems like a notable improvement over the existing random sampling strategy

Weaknesses

1. Being able to construct a Pareto Front of models is in itself an interesting result. But what is the practical value of this? The objective of multi-task learning is to produce one model that performs well on two or more tasks. By this definition, it seems sufficient to just have one model, or rather one point on the Pareto Front. I think it will make the paper much stronger if some practicality can be demonstrated, besides the nice theoretical result. Perhaps the Pareto Front can help in OOD

Reviewer 03Rating 5Confidence 3

Strengths

1. This paper proposes to address multi-task learning with multiple low-rank adapters equipped with Pareto Front Learning.

Weaknesses

1. It lacks discussions between PaLoRA and MoE-like methods. 2. It lacks discussions between PaLoRA and general adaptive weight learning methods. How is PaLoRA better and why? 3. It seems that this work applies PFL on multiple Lora, any significant differences between general PFL applications? 4. Comparisons between GFLOPS/Memory/Speed with other methods at the inference stage? 5. What is "Functional Diversity" in Sec 4.2? Any definitions or quantitative evaluations? 6. Any comparisons between P

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Machine Learning and Data Classification · Data Stream Mining Techniques