TL;DR
FDM introduces a sequence model that achieves constant decode memory and superior associative recall by separating sequence processing into wave and particle components, with a novel training strategy and holographic decoding interpretation.
Contribution
The paper presents FDM, a novel linear sequence architecture with fixed O(1) decode memory and improved training and decoding methods, surpassing traditional transformers in efficiency and recall.
Findings
FDM reduces decode memory by 4.9x compared to Transformers at N=8,192 tokens.
Joint training of wave and particle components leads to suboptimal convergence, addressed by Freeze-Scan.
FDM achieves 0.966 accuracy on MQAR, outperforming Transformer significantly.
Abstract
We present FDM (Fan Duality Model), a linear sequence architecture that resolves the fundamental tension between memory efficiency and associative recall in sequence modeling. FDM separates sequence processing into two components: a wave component (recurrent scan via phase-preserving Givens rotations) that compresses long-range patterns into a fixed-size complex hidden state, and a particle component (local-global cache) that retrieves specific tokens via learned associative addressing with W+K=272 slots independent of sequence length N. This yields strictly O(1) decode memory: 867 MB fixed across all prompt lengths 128-8,192 tokens, versus Transformer's 853-4,247 MB (4.9x reduction at N=8,192). Beyond the architecture, we discover that jointly training the wave and particle components leads to suboptimal convergence. We propose Freeze-Scan, a two-phase training strategy that freezes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
