PoM: Efficient Image and Video Generation with the Polynomial Mixer
David Picard, Nicolas Dufour

TL;DR
This paper introduces the Polynomial Mixer, a linear-complexity alternative to Multi-Head Attention for diffusion models, enabling efficient high-quality image and video generation with reduced memory and compute requirements.
Contribution
The Polynomial Mixer replaces MHA in diffusion transformers, offering linear complexity and sequential frame generation, improving efficiency without sacrificing quality.
Findings
High-quality image and video samples with less computational resources.
Polynomial Mixer achieves linear complexity, reducing memory and compute costs.
Adaptation of diffusion transformers with PoM maintains performance.
Abstract
Diffusion models based on Multi-Head Attention (MHA) have become ubiquitous to generate high quality images and videos. However, encoding an image or a video as a sequence of patches results in costly attention patterns, as the requirements both in terms of memory and compute grow quadratically. To alleviate this problem, we propose a drop-in replacement for MHA called the Polynomial Mixer (PoM) that has the benefit of encoding the entire sequence into an explicit state. PoM has a linear complexity with respect to the number of tokens. This explicit state also allows us to generate frames in a sequential fashion, minimizing memory and compute requirement, while still being able to train in parallel. We show the Polynomial Mixer is a universal sequence-to-sequence approximator, just like regular MHA. We adapt several Diffusion Transformers (DiT) for generating images and videos with PoM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image and Signal Denoising Methods · Image Retrieval and Classification Techniques
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Diffusion
