The Condensate Theorem: Transformers are O(n), Not $O(n^2)$
Jorge L. Ruiz Williams

TL;DR
This paper introduces the Condensate Theorem, showing that attention sparsity in transformers is a learned topological property, enabling lossless, linear-time attention computation with significant speedups and reduced inference costs.
Contribution
It demonstrates that attention can be computed losslessly in linear time by projecting onto a learned topological manifold, challenging the quadratic complexity assumption.
Findings
Attention mass concentrates on a topological manifold in trained models
Projection onto the Condensate Manifold achieves lossless $O(n)$ attention
Significant speedups in inference performance across multiple models
Abstract
We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold -- and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full attention. This is not an approximation -- it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected >1,200x speedup at 1M…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Generative Adversarial Networks and Image Synthesis · Advanced Graph Neural Networks
