Rethinking Dense Linear Transformations: Stagewise Pairwise Mixing (SPM) for Near-Linear Training in Neural Networks

Peter Farag

arXiv:2512.23905·cs.LG·January 1, 2026

Rethinking Dense Linear Transformations: Stagewise Pairwise Mixing (SPM) for Near-Linear Training in Neural Networks

Peter Farag

PDF

Open Access

TL;DR

This paper introduces Stagewise Pairwise Mixers (SPM), a structured linear operator that replaces dense layers with efficient, compositional, sparse stages, reducing computational cost and improving generalization in neural networks.

Contribution

The paper proposes SPM, a novel structured linear layer that achieves near-linear training complexity and can replace dense layers, with explicit forward/backward formulas and improved generalization.

Findings

01

SPM reduces computational cost significantly.

02

SPM improves accuracy on structured learning tasks.

03

SPM retains competitive performance on benchmarks.

Abstract

Dense linear layers are a dominant source of computational and parametric cost in modern machine learning models, despite their quadratic complexity and often being misaligned with the compositional structure of learned representations. We introduce Stagewise Pairwise Mixers (SPM), a structured linear operator that replaces dense matrices with a composition of sparse pairwise-mixing stages. An SPM layer implements a global linear transformation in $O (n L)$ time with $O (n L)$ parameters, where $L$ is typically constant or $l o g_{2} n$ , and admits exact closed-form forward and backward computations. SPM is designed as a drop-in replacement for dense linear layers in feedforward networks, recurrent architectures, attention mechanisms, etc. We derive complete forward and backward expressions for two parameterizations: an orthogonal norm-preserving rotation-based variant and a fully general $2…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Generative Adversarial Networks and Image Synthesis