Beyond Linear Approximations: A Novel Pruning Approach for Attention   Matrix

Yingyu Liang; Jiangxuan Long; Zhenmei Shi; Zhao Song; Yufa Zhou

arXiv:2410.11261·cs.LG·February 27, 2025·3 cites

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou

PDF

Open Access 1 Video

TL;DR

This paper presents a new non-linear pruning method for attention matrices in LLMs that improves efficiency and maintains performance, addressing the limitations of linear approximation methods.

Contribution

It introduces a gradient descent-based pruning approach that directly optimizes attention matrix approximation, with theoretical guarantees and superior empirical results.

Findings

01

Significant reduction in computational costs compared to state-of-the-art methods

02

Maintains model performance after pruning

03

Provides theoretical convergence guarantees for the pruning method

Abstract

Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model sizes, making deployment on edge devices challenging due to memory and computational constraints. This paper introduces a novel approach to LLM weight pruning that directly optimizes for approximating the attention matrix, a core component of transformer architectures. Unlike existing methods that focus on linear approximations, our approach accounts for the non-linear nature of the Softmax attention mechanism. We provide theoretical guarantees for the convergence of our Gradient Descent-based optimization method to a near-optimal pruning mask solution. Our empirical results demonstrate the effectiveness of our non-linear pruning approach in maintaining…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix· slideslive

Taxonomy

TopicsNeural Networks and Applications

MethodsAttention Is All You Need · Softmax · Focus · Pruning