Multi-matrix Factorization Attention

Jingcheng Hu; Houyi Li; Yinmin Zhang; Zili Wang; Shuigeng Zhou,; Xiangyu Zhang; Heung-Yeung Shum; Daxin Jiang

arXiv:2412.19255·cs.LG·January 15, 2025

Multi-matrix Factorization Attention

Jingcheng Hu, Houyi Li, Yinmin Zhang, Zili Wang, Shuigeng Zhou,, Xiangyu Zhang, Heung-Yeung Shum, Daxin Jiang

PDF

Open Access 2 Models

TL;DR

This paper introduces Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR), novel attention architectures that improve model capacity and reduce memory usage under strict Key-Value cache constraints, outperforming existing methods.

Contribution

The paper presents MFA and MFA-KR, new attention mechanisms that enhance capacity and efficiency under cache constraints, a significant advancement over prior multi-head attention variants.

Findings

01

MFA outperforms MLA under tight KV cache conditions.

02

MFA-KR reduces memory usage by up to 93.7%.

03

Both methods maintain competitive performance with standard MHA.

Abstract

We propose novel attention architectures, Multi-matrix Factorization Attention (MFA) and MFA-Key-Reuse (MFA-KR). Existing variants for standard Multi-Head Attention (MHA), including SOTA methods like MLA, fail to maintain as strong performance under stringent Key-Value cache (KV cache) constraints. MFA enhances model capacity by efficiently scaling up both the number and dimension of attention heads through low-rank matrix factorization in the Query-Key (QK) circuit. Extending MFA, MFA-KR further reduces memory requirements by repurposing the key cache as value through value projection re-parameterization. MFA's design enables strong model capacity when working under tight KV cache budget, while MFA-KR is suitable for even harsher KV cache limits with minor performance trade-off. Notably, in our extensive and large-scale experiments, the proposed architecture outperforms MLA and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCognitive Science and Mapping

MethodsLinear Layer · Softmax · Attention Is All You Need · Multi-Head Attention