TL;DR
This paper proposes a novel approach to global sequence modeling that replaces explicit attention with dynamically predicted parameters, achieving Transformer-level performance with linear complexity.
Contribution
It introduces a dynamic parameterization method that models global context without explicit attention, enabling efficient linear-time sequence modeling.
Findings
Dynamic parameterization can replace explicit attention in vision models.
The proposed method achieves comparable performance to Transformers.
Code is available at https://github.com/LeapLabTHU/WeightFormer.
Abstract
Existing research largely attributes the global sequence modeling capability of Transformers to the explicit computation of attention weights, a process that inherently incurs quadratic computational complexity. In this work, we offer a novel perspective: we demonstrate that attention can be mathematically reframed as a Multi-Layer Perceptron (MLP) equipped with dynamically predicted parameters. Through this lens, we explain attention's global modeling power not as explicit token-wise aggregation, but as an implicit process where dynamically generated parameters act as a compressed representation of the global context. Inspired by this insight, we investigate a fundamental question: can we achieve Transformer-level sequence global modeling entirely through dynamic parameterization while maintaining linear complexity, effectively replacing explicit attention? To explore this, we design…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
