Rethinking Attention Output Projection: Structured Hadamard Transforms for Efficient Transformers
Shubham Aggarwal, Lokendra Kumar

TL;DR
This paper introduces a fixed Walsh Hadamard Transform to replace dense output projections in multi-head attention, reducing parameters and improving efficiency while maintaining performance.
Contribution
It proposes a novel, parameter-free structured transform for attention output projection that enhances compute efficiency and scalability in transformer models.
Findings
WHT reduces attention parameters by ~25% per block.
Models with WHT show better compute utilization during training.
Efficiency gains grow with model size, batch size, and sequence length.
Abstract
The dense output projection in multi head attention scales quadratically with model dimension, contributing significantly to parameter count, memory footprint, and inference cost. We propose replacing this projection with a fixed, parameter free Walsh Hadamard Transform (WHT) followed by a diagonal affine transformation. This approach eliminates approximately 25 percent of attention parameters per block while maintaining global cross-head interaction through an orthogonal, norm-preserving transformation. Our results demonstrate that WHT augmented models exhibit a steeper validation loss curve relative to training FLOPs compared to dense baselines, suggesting superior compute utilization during training. Crucially, we show that efficiency gains including reduced memory footprint and increased throughput grow monotonically with model size, batch size, and sequence length. We evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
