Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Magauiya Zhussip; Dmitriy Shopkhoev; Ammar Ali; Stamatios Lefkimmiatis

arXiv:2508.04581·cs.CL·February 23, 2026

Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning

Magauiya Zhussip, Dmitriy Shopkhoev, Ammar Ali, Stamatios Lefkimmiatis

PDF

TL;DR

This paper introduces MASA, a method for sharing weights across transformer layers using matrix-based dictionary learning, significantly reducing parameters while maintaining or improving performance.

Contribution

MASA is a novel framework that decomposes attention matrices into shared dictionary atoms, enabling parameter reduction and efficient layer sharing without architectural changes.

Findings

01

Reduces attention module parameters by 66.7% with maintained performance

02

Outperforms baselines like GQA and low-rank methods at similar parameter budgets

03

Extends effectively to Vision Transformers with fewer parameters and no performance loss

Abstract

Large language models have revolutionized AI applications, yet their high computational and memory demands hinder their widespread deployment. Existing compression techniques focus on intra-block optimizations (e.g., low-rank approximation or attention pruning), while the repetitive layered structure of transformers implies significant inter-block redundancy - a dimension largely unexplored beyond key-value (KV) caching. Inspired by dictionary learning in convolutional networks, we propose a framework for structured weight sharing across transformer layers. Our approach decomposes attention projection matrices (Q, K, V, O) into shared dictionary atoms, reducing the attention module's parameters by 66.7\% while achieving on-par performance. Unlike complex methods requiring distillation or architectural changes, MASA (Matrix Atom Sharing in Attention) operates as a drop-in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.