Rethinking Dense Linear Transformations: Stagewise Pairwise Mixing (SPM) for Near-Linear Training in Neural Networks
Peter Farag

TL;DR
This paper introduces Stagewise Pairwise Mixers (SPM), a structured linear operator that replaces dense layers with efficient, compositional, sparse stages, reducing computational cost and improving generalization in neural networks.
Contribution
The paper proposes SPM, a novel structured linear layer that achieves near-linear training complexity and can replace dense layers, with explicit forward/backward formulas and improved generalization.
Findings
SPM reduces computational cost significantly.
SPM improves accuracy on structured learning tasks.
SPM retains competitive performance on benchmarks.
Abstract
Dense linear layers are a dominant source of computational and parametric cost in modern machine learning models, despite their quadratic complexity and often being misaligned with the compositional structure of learned representations. We introduce Stagewise Pairwise Mixers (SPM), a structured linear operator that replaces dense matrices with a composition of sparse pairwise-mixing stages. An SPM layer implements a global linear transformation in time with parameters, where is typically constant or , and admits exact closed-form forward and backward computations. SPM is designed as a drop-in replacement for dense linear layers in feedforward networks, recurrent architectures, attention mechanisms, etc. We derive complete forward and backward expressions for two parameterizations: an orthogonal norm-preserving rotation-based variant and a fully general $2…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Generative Adversarial Networks and Image Synthesis
