Parameter-Efficient Subspace Optimization for LLM Fine-Tuning
Yuchen Lou, Zeqi Ye, Minshuo Chen

TL;DR
This paper introduces PESO, a unifying framework for parameter-efficient fine-tuning of large language models, connecting existing methods like LoRA to subspace optimization theory and providing convergence guarantees.
Contribution
The paper proposes PESO, a new subspace optimization framework for PEFT, unifying and extending methods like LoRA with theoretical convergence guarantees.
Findings
PESO-LoRA outperforms existing PEFT methods on standard benchmarks.
Provides convergence guarantees in the full-parameter space.
Connects PEFT methods to classical subspace optimization theory.
Abstract
This paper develops a new perspective on parameter-efficient fine-tuning (PEFT) for LLMs, inspired by classical subspace minimization. We introduce a unifying framework, Parameter-Efficient Subspace Optimization (PESO), which recovers existing methods such as LoRA and connects them to the principled algorithmic and theoretical foundations of subspace optimization. This connection highlights a natural ``exploration--exploitation'' view of subspace methods, guiding the design of new algorithms that achieve strong convergence performance while still preserving memory efficiency. We instantiate the framework into a practical algorithm, PESO-LoRA, based on a LoRA-type parameterization. Importantly, we provide convergence guarantees stated in the full-parameter space for the induced update, addressing a key limitation of LoRA-style analyses that only track low-dimensional factors.…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The PESO framework bridges PEFT with classical subspace minimization, offering an exploration–exploitation perspective and a unified Algorithm 1 that generalizes several existing methods. 2. PESO-LoRA-R and PESO-LoRA-T emerge as straightforward, practical special cases directly derived from the framework. 3. The paper presents theoretical guarantees for full-rank convergence under the stated assumptions. 4. The model is empirically evaluated through Llama-2-7B pre-training and multiple benchm
1. Since the core theme of the paper revolves around exploration-exploitation, it would be natural to include targeted ablation studies, particularly examining the effects of restart frequency (K), rank (r), and related parameters. 2. Although the paper positions itself as a unifying framework, it lacks in-depth discussion and comparison with key baselines in this area; notably GaLore [1] and other state-of-the-art methods. 3. (Please correct me if I’m mistaken,) but M appears to be defined inco
- Provides a framework that can cover some existing low-rank fine-tuning approaches - The paper is well-written in general and easy to follow
While the paper claims contributions at the conceptual, theoretical, and empirical levels, these contributions appear insufficiently substantiated. 1. **Conceptual novelty**. The subspace minimization perspective is not new. This viewpoint has already been well established in GaLore [A1] and more recently revisited in Randomized Subspace Optimization (RSO) [A2]. In particular, the proposed framework in Equation (3) closely resembles RSO, where a low-rank variable $\xi$ is obtained by solving a
The strengths of this paper are summarized as follows: 1. It has combined multiple Parameter-Efficient Fine-Tuning (PEFT) methods, such as LoRA, AdaLoRA, and GaLore, using a single mathematical view. 2. Theoretically, it has given the first proof of a full-parameter convergence guarantee for memory memory-efficient fine-tuning method. The convergence guarantee is in the full model weight space. 3. The proposed framework, PESO, is practical. It is a plug and play design and can improve existi
The weaknesses of this paper are summarized as follows: 1. The experimental results are based on T5-base and LLaMA-2-7B. It would be better if the authors could consider including more experimental results on more models, such as LLaMA 3, and it would be more interesting to test models on different sizes. 2. The experimental results seem to focus on fine-tuning. It would be better if the authors may consider full pre-training. Also, it primarily compares against LoRA-based baselines. It lacks
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Model Reduction and Neural Networks · VLSI and FPGA Design Techniques
