Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

Shubham Aggarwal; Lokendra Kumar

arXiv:2603.08343·cs.LG·March 31, 2026

Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers

Shubham Aggarwal, Lokendra Kumar

PDF

TL;DR

This paper introduces a fixed Walsh Hadamard Transform to replace dense output projections in multi-head attention, reducing parameters and improving efficiency while maintaining performance.

Contribution

It proposes a novel, parameter-free structured transform for attention output projection that enhances compute efficiency and scalability in transformer models.

Findings

01

WHT reduces attention parameters by ~25% per block.

02

Models with WHT show better compute utilization during training.

03

Efficiency gains grow with model size, batch size, and sequence length.

Abstract

The dense output projection in multi head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter free Walsh Hadamard Transform (WHT) followed by a diagonal affine transformation. This approach eliminates approximately 25 percent of attention parameters per block while maintaining global cross-head interaction through an orthogonal, norm-preserving transformation. Our results demonstrate that WHT augmented models exhibit a steeper validation loss curve relative to training FLOPs compared to dense baselines, suggesting superior compute utilization during training. Crucially, we show that efficiency gains including reduced memory footprint and increased throughput grow monotonically with model size, batch size, and sequence length. We evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.