Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers

Yukun Zhang; Xueqing Zhou

arXiv:2505.20666·cs.LG·December 30, 2025

Continuous-Time Attention: PDE-Guided Mechanisms for Long-Sequence Transformers

Yukun Zhang, Xueqing Zhou

PDF

Open Access

TL;DR

This paper introduces Continuous_Time Attention, a PDE-guided extension to Transformer attention mechanisms, enabling better modeling of long sequences through dynamic, smooth attention weights that improve dependency capture and training stability.

Contribution

It presents a novel PDE-based framework for attention in Transformers, allowing weights to evolve over pseudo_time, which improves long-range dependency modeling and training stability.

Findings

01

Consistent performance improvements over standard Transformers on long sequence tasks.

02

Theoretical analysis shows PDE-based attention offers better optimization landscapes.

03

Enhanced ability to model global coherence in long sequences.

Abstract

We propose a novel framework, Continuous_Time Attention, which infuses partial differential equations (PDEs) into the Transformer's attention mechanism to address the challenges of extremely long input sequences. Instead of relying solely on a static attention matrix, we allow attention weights to evolve over a pseudo_time dimension via diffusion, wave, or reaction_diffusion dynamics. This mechanism systematically smooths local noise, enhances long_range dependencies, and stabilizes gradient flow. Theoretically, our analysis shows that PDE_based attention leads to better optimization landscapes and polynomial rather than exponential decay of distant interactions. Empirically, we benchmark our method on diverse experiments_demonstrating consistent gains over both standard and specialized long sequence Transformer variants. Our findings highlight the potential of PDE_based formulations to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Exponential Decay