FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
Waleed Razzaq, Yun-Bo Zhao

TL;DR
FLUID introduces a continuous-time Transformer with Liquid Attention Network, integrating continuous dynamics into attention computation for improved modeling of irregular data, long-range dependencies, and physical dynamics.
Contribution
It proposes FLUID, a novel CT Transformer that embeds continuous dynamics into attention, with stability guarantees and superior empirical performance across diverse tasks.
Findings
FLUID outperforms CT baselines by up to 47% in certain tasks.
It demonstrates robustness to noise and better generalization under distributional shifts.
FLUID achieves a balance between runtime and memory efficiency among competing models.
Abstract
Continuous-time (CT) Transformers improve irregular and long-range modeling over CT-RNNs by exploiting inputs or outputs embeddings with continuous dynamics. However, the core scaled-dot-product-attention (SDPA) mechanism remains inherently discrete. We propose FLUID (Flexible Unified Information Dynamics), a CT Transformer that incorporates continuous dynamics directly into the attention computation by replacing it with Liquid Attention Network (LAN). LAN reinterprets attention logits as continuous dynamical system and reformulates them as the solution to a linear ODE modulated by input-dependent nonlinear recurrent gates. Theoretically, we establish stability guarantees for LAN dynamics and show that it serves as an interpolating middle ground between SDPA and CT-RNNs, recovering each as special case under well-defined parameterization of its gating functions. LAN also introduces an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
