TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

Yuxuan Gu; Wuyang Zhou; Giorgos Iacovides; Danilo Mandic

arXiv:2501.15674·cs.CL·May 16, 2025

TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs

Yuxuan Gu, Wuyang Zhou, Giorgos Iacovides, Danilo Mandic

PDF

Open Access 1 Repo

TL;DR

TensorLLM introduces a novel tensorisation and Tucker decomposition method to compress and denoise Multi-head Attention in LLMs, significantly enhancing reasoning abilities without extra training.

Contribution

The paper presents a new tensorisation framework for MHA weights that enables high-dimensional denoising and compression, improving LLM reasoning performance.

Findings

01

Achieves up to 250x compression of MHA weights.

02

Enhances reasoning capabilities across multiple benchmarks.

03

Can be combined with existing denoising techniques for further gains.

Abstract

The reasoning abilities of Large Language Models (LLMs) can be improved by structurally denoising their weights, yet existing techniques primarily focus on denoising the feed-forward network (FFN) of the transformer block, and can not efficiently utilise the Multi-head Attention (MHA) block, which is the core of transformer architectures. To address this issue, we propose a novel intuitive framework that, at its very core, performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. This enables both higher-dimensional structured denoising and compression of the MHA weights, by enforcing a shared higher-dimensional subspace across the weights of the multiple attention heads. We demonstrate that this approach consistently enhances the reasoning capabilities of LLMs across multiple benchmark datasets, and for both encoder-only and decoder-only…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guyuxuan9/tensorllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques

MethodsSoftmax · Linear Layer · Attention Is All You Need · Multi-Head Attention · TuckER · Focus