PDE-SSM: A Spectral State Space Approach to Spatial Mixing in Diffusion Transformers
Eshed Gal, Moshe Eliasof, Siddharth Rout, Eldad Haber

TL;DR
This paper introduces PDE-SSM, a spectral state-space method using PDEs to replace attention in vision transformers, achieving scalable, physically grounded spatial modeling with competitive performance.
Contribution
It presents PDE-SSM, a novel PDE-based operator for vision transformers that replaces attention, offering scalable, physics-inspired spatial modeling with improved efficiency.
Findings
PDE-SSM matches or exceeds state-of-the-art performance.
Reduces computational complexity compared to traditional attention.
Provides a physically grounded, scalable alternative to attention mechanisms.
Abstract
The success of vision transformers-especially for generative modeling-is limited by the quadratic cost and weak spatial inductive bias of self-attention. We propose PDE-SSM, a spatial state-space block that replaces attention with a learnable convection-diffusion-reaction partial differential equation. This operator encodes a strong spatial prior by modeling information flow via physically grounded dynamics rather than all-to-all token interactions. Solving the PDE in the Fourier domain yields global coupling with near-linear complexity of , delivering a principled and scalable alternative to attention. We integrate PDE-SSM into a flow-matching generative model to obtain the PDE-based Diffusion Transformer PDE-SSM-DiT. Empirically, PDE-SSM-DiT matches or exceeds the performance of state-of-the-art Diffusion Transformers while substantially reducing compute. Our results show…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1. The idea of using PDE-based state-space operators for spatial feature mixing is novel and elegant. 2. The paper provides a clear computational complexity analysis, showing how the Fourier-domain solver achieves O(NlogN) scalability. 3. The writing is clear and technically mature. 4. The paper contributes a generalizable new building block for spatial deep learning—potentially applicable beyond diffusion transformers (e.g., segmentation, SR, video models).
1. While results are solid, most experiments are low- to mid-resolution (≤ 256×256). Testing PDE-SSM-DiT on high-resolution datasets (e.g., ImageNet256, LAION subsets) would better validate scalability and efficiency claims in realistic generative settings. 2. The paper mainly evaluates image generation. Given the generality of PDE-SSM, additional tasks (e.g., classification, segmentation, or video generation) could demonstrate broader utility. 3. The ablation of individual PDE terms (diffusio
1. **Novel 2D PDE formulation** PDE-SSM provides a principled generalization of 1D SSMs to 2D spatial domains by replacing the ODE with a diffusion–convection–reaction PDE. This formulation enables spatially coupled feature mixing that respects the grid topology of images rather than flattening them into 1D sequences, addressing the main structural limitation of prior Vision State Space Models. 2. **Reduced time complexity** Solving the PDE in the Fourier domain allows global token interactio
1. **Lack of quantitative evidence for spatial awareness** While the PDE formulation intuitively introduces spatial coupling, the experiments primarily report FID and runtime metrics. Including qualitative or quantitative analyses—such as spatial frequency responses or attention-map analogues—would strengthen the claim that PDE-SSM meaningfully captures spatial structure. 2. **Non-standard experiment settings** Generative evaluations are conducted mostly in pixel space, which is a non-standar
1. **Novel Generalization from ODE to PDE** Extending SSMs from ordinary differential equations (ODEs) to partial differential equations (PDEs) is an original and conceptually significant contribution. 2. **Well-Motivated Spatial Mixing Mechanism** The motivation for adopting PDEs as a means of modeling spatial interactions—rather than relying on ODE-based temporal dynamics—is clearly presented and well-justified. 3. **Clarity of Presentation** The paper is well-written and
1. **Higher Training Cost Despite Improved Scaling** Although PDE-SSM scales more favorably with sequence length, it still incurs a higher training cost compared to DiT. This undermines its practical gains in computational efficiency. 2. **Lack of Scaling Experiments** DiTs are known for their excellent scalability with model size. To establish PDE-SSM as a viable alternative, experiments demonstrating similar or superior scaling behavior are essential. Specifically, results showing
- The plug-and-play integration into DiT retains original training schedules, demonstrating ease of adopting the proposed method - The extension of state-space models from 1D sequences to spatial domains via PDEs is novel, many discussions on the theories. - Results demonstrate improvements over DiT on multiple datasets.
- I am concerned on whether the proposed method can run efficiently on modern GPUs. Although the complexity is less than DiT, it may not be faster in practice. Latency/FPS compared to DiT should be reported. - The performance gains are obtained on small datasets, where the quadratic complexity do not matter much. I wonder if the results are still comparable on 256x256 or 512x512 resolution. - There are many other works that use state space models to address the quadratic complexity of attention
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks · Advanced Memory and Neural Computing
