Hierarchical Shift Mixing -- Beyond Dense Attention in Transformers
Robert Forchheimer

TL;DR
This paper introduces Hierarchical Shift Mixing (HSM), a novel token mixing framework for Transformers that reduces complexity to linear time, maintaining high performance and enabling efficient hybrid architectures.
Contribution
HSM is a flexible, layer-distributed token mixing method that achieves near-softmax performance with linear complexity, surpassing prior dense attention approaches.
Findings
HSM variants perform close to softmax attention in accuracy.
Hybrid HSM-softmax models outperform baseline Transformers.
HSM reduces training and inference computational costs.
Abstract
Since the introduction of the Transformer architecture for large language models, the softmax-based attention layer has faced increasing scrutinity due to its quadratic-time computational complexity. Attempts have been made to replace it with less complex methods, at the cost of reduced performance in most cases. We introduce Hierarchical Shift Mixing (HSM), a general framework for token mixing that distributes pairwise token interactions across Transformer layers rather than computing them densely within each layer. HSM enables linear-time complexity while remaining agnostic to the specific mixing function. We show that even simple HSM variants achieve performance close to softmax attention, and that hybrid architectures combining HSM with softmax attention can outperform a GPT-style Transformer baseline while reducing computational cost during both training and inference.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Generative Adversarial Networks and Image Synthesis
