Paramixer: Parameterizing Mixing Links in Sparse Factors Works Better than Dot-Product Self-Attention
Tong Yu, Ruslan Khalitov, Lei Cheng, Zhirong Yang

TL;DR
Paramixer introduces a scalable, full-rank mixing block that outperforms traditional dot-product self-attention by reducing computational complexity to O(N log N) and avoiding low-rank limitations.
Contribution
The paper proposes Paramixer, a novel sparse matrix factorization approach with MLP-parameterized entries, improving efficiency and effectiveness over standard self-attention methods.
Findings
Paramixer achieves better performance on synthetic and real-world datasets.
It reduces computational cost to O(N log N) compared to O(N^2).
All factorizing matrices are full-rank, avoiding low-rank bottlenecks.
Abstract
Self-Attention is a widely used building block in neural modeling to mix long-range data elements. Most self-attention neural networks employ pairwise dot-products to specify the attention coefficients. However, these methods require computing cost for sequence length . Even though some approximation methods have been introduced to relieve the quadratic cost, the performance of the dot-product approach is still bottlenecked by the low-rank constraint in the attention matrix factorization. In this paper, we propose a novel scalable and effective mixing building block called Paramixer. Our method factorizes the interaction matrix into several sparse matrices, where we parameterize the non-zero entries by MLPs with the data elements as input. The overall computing cost of the new building block is as low as . Moreover, all factorizing matrices in Paramixer are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM · Neural Networks and Applications · Advanced Neural Network Applications
