CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning

Fanxu Meng; Pingzhi Tang; Fan jiang; Muhan Zhang

arXiv:2411.17426·cs.LG·February 3, 2025

CLOVER: Cross-Layer Orthogonal Vectors Pruning and Fine-Tuning

Fanxu Meng, Pingzhi Tang, Fan jiang, Muhan Zhang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

CLOVER introduces a novel low-rank decomposition approach using SVD on attention layers to enable efficient pruning and fine-tuning of large models without increasing parameters, significantly improving performance and pruning efficiency.

Contribution

The paper presents CLOVER, a new method applying SVD to attention layers for effective pruning and fine-tuning, outperforming existing techniques across multiple models and tasks.

Findings

01

CLOVER achieves similar perplexity with 70% pruning as vanilla methods do with 8%.

02

Fine-tuning singular values enhances model performance beyond state-of-the-art methods.

03

CLOVER improves pruning efficiency and model adaptability across various large models.

Abstract

Decoder-only models generate tokens autoregressively by caching key/value vectors, but as the cache grows, inference becomes memory-bound. To address this issue, we introduce CLOVER (Cross-Layer Orthogonal Vectors), a novel approach that treats pairs of attention layers as a set of low-rank decompositions. CLOVER applies Singular Value Decomposition (SVD) to the \( Q \)-\( K \) and \( V \)-\( O \) pairs within each attention head. The resulting singular values can either guide pruning or serve as trainable parameters for efficient fine-tuning of all orthogonal vectors. After pruning or fine-tuning, these values are reintegrated into the model without increasing its parameter count. We apply CLOVER to various models, including GPT-2 XL, DeepSeek-V2-Lite, Whisper-Large-v3, Stable Diffusion XL, and LLaMA-3.2-11B-Vision. Our results demonstrate that CLOVER significantly improves pruning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

graphpku/pissa
pytorchOfficial

Datasets

fxmeng/commonsense_filtered
dataset· 64 dl
64 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI-based Problem Solving and Planning · Rough Sets and Fuzzy Logic

MethodsAttention Is All You Need · Cosine Annealing · Adam · Softmax · Dropout · Linear Warmup With Cosine Annealing · Attention Dropout · Linear Layer · Byte Pair Encoding · Dense Connections