Solving Attention Kernel Regression Problem via Pre-conditioner

Zhao Song; Junze Yin; Lichen Zhang

arXiv:2308.14304·cs.LG·April 3, 2024·1 cites

Solving Attention Kernel Regression Problem via Pre-conditioner

Zhao Song, Junze Yin, Lichen Zhang

PDF

Open Access

TL;DR

This paper introduces fast algorithms for approximating attention matrices in large language models by solving specialized regression problems using sketching and preconditioning techniques.

Contribution

It proposes novel regression algorithms for matrix exponential proxies of attention matrices, enabling more efficient computation in large-scale models.

Findings

01

Developed algorithms for regression with matrix exponential proxies

02

Designed methods for exponential entrywise applied to Gram matrices

03

Provided efficient approximation techniques for attention matrices

Abstract

The attention mechanism is the key to large language models, and the attention matrix serves as an algorithmic and computational bottleneck for such a scheme. In this paper, we define two problems, motivated by designing fast algorithms for proxy of attention matrix and solving regressions against them. Given an input matrix $A \in R^{n \times d}$ with $n ≫ d$ and a response vector $b$ , we first consider the matrix exponential of the matrix $A^{⊤} A$ as a proxy, and we in turn design algorithms for two types of regression problems: $min_{x \in R^{d}} ∥ (A^{⊤} A)^{j} x - b ∥_{2}$ and $min_{x \in R^{d}} ∥ A (A^{⊤} A)^{j} x - b ∥_{2}$ for any positive integer $j$ . Studying algorithms for these regressions is essential, as matrix exponential can be approximated term-by-term via these smaller problems. The second proxy is applying exponential entrywise to the Gram matrix, denoted by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Graphene research and applications