TeZO: Empowering the Low-Rankness on the Temporal Dimension in the   Zeroth-Order Optimization for Fine-tuning LLMs

Yan Sun; Tiansheng Huang; Liang Ding; Li Shen; Dacheng Tao

arXiv:2501.19057·cs.LG·February 3, 2025

TeZO: Empowering the Low-Rankness on the Temporal Dimension in the Zeroth-Order Optimization for Fine-tuning LLMs

Yan Sun, Tiansheng Huang, Liang Ding, Li Shen, Dacheng Tao

PDF

Open Access

TL;DR

TeZO introduces a novel low-rank zeroth-order optimizer that captures gradient low-rankness across model and temporal dimensions, significantly reducing memory and computational costs in fine-tuning large language models.

Contribution

The paper proposes TeZO, a low-rank zeroth-order estimator that considers both model and temporal gradient properties, extending existing methods and improving efficiency.

Findings

01

Achieves state-of-the-art performance with lower memory usage.

02

Reduces training cost by exploiting low-rank structures.

03

Extends easily to Adam optimizer with less memory than alternatives.

Abstract

Zeroth-order optimization (ZO) has demonstrated remarkable promise in efficient fine-tuning tasks for Large Language Models (LLMs). In particular, recent advances incorporate the low-rankness of gradients, introducing low-rank ZO estimators to further reduce GPU memory consumption. However, most existing works focus solely on the low-rankness of each individual gradient, overlooking a broader property shared by all gradients throughout the training, i.e., all gradients approximately reside within a similar subspace. In this paper, we consider two factors together and propose a novel low-rank ZO estimator, TeZO, which captures the low-rankness across both the model and temporal dimension. Specifically, we represent ZO perturbations along the temporal dimension as a 3D tensor and employ Canonical Polyadic Decomposition (CPD) to extract each low-rank 2D matrix, significantly reducing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Control Systems Design · Iterative Learning Control Systems · Model Reduction and Neural Networks

MethodsAdam · Focus