Gradient Weight-normalized Low-rank Projection for Efficient LLM   Training

Jia-Hong Huang; Yixian Shen; Hongyi Zhu; Stevan Rudinac; Evangelos; Kanoulas

arXiv:2412.19616·cs.LG·January 7, 2025

Gradient Weight-normalized Low-rank Projection for Efficient LLM Training

Jia-Hong Huang, Yixian Shen, Hongyi Zhu, Stevan Rudinac, Evangelos, Kanoulas

PDF

Open Access 1 Repo

TL;DR

GradNormLoRP is a novel method that improves the efficiency of training large language models by normalizing weights and applying low-rank approximations, reducing memory usage while maintaining performance.

Contribution

It introduces GradNormLoRP, a technique that enhances parameter and memory efficiency for LLM training through weight normalization and low-rank projections, outperforming existing methods.

Findings

01

Reduces optimizer memory usage by up to 89.5%.

02

Enables pre-training of large LLMs on consumer GPUs.

03

Outperforms existing low-rank methods in fine-tuning tasks.

Abstract

Large Language Models (LLMs) have shown remarkable performance across various tasks, but the escalating demands on computational resources pose significant challenges, particularly in the extensive utilization of full fine-tuning for downstream tasks. To address this, parameter-efficient fine-tuning (PEFT) methods have been developed, but they often underperform compared to full fine-tuning and struggle with memory efficiency. In this work, we introduce Gradient Weight-Normalized Low-Rank Projection (GradNormLoRP), a novel approach that enhances both parameter and memory efficiency while maintaining comparable performance to full fine-tuning. GradNormLoRP normalizes the weight matrix to improve gradient conditioning, facilitating better convergence during optimization. Additionally, it applies low-rank approximations to the weight and gradient matrices, significantly reducing memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jhhuangkay/gradient-weight-normalized-low-rank-projection-for-efficient-llm-training
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsOptical Systems and Laser Technology · Sparse and Compressive Sensing Techniques · Robotics and Sensor-Based Localization

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Attention Dropout · Linear Layer · Softmax · Dense Connections · Linear Warmup With Linear Decay · Dropout · WordPiece · Residual Connection