Caracal: Causal Architecture via Spectral Mixing
Bingzheng Gan, Tianyi Zhang, Yusu Li, Jing Huang, Wei Shi, Yangkai Ding, Tao Yu

TL;DR
Caracal introduces a Fourier-based architecture for long-sequence modeling that is scalable, efficient, and portable, addressing key limitations of traditional attention mechanisms in large language models.
Contribution
It replaces attention with a spectral mixing module using FFT, enabling scalable, portable, and autoregressive long-sequence modeling without hardware-specific optimizations.
Findings
Caracal achieves competitive performance with Transformer and SSM baselines.
It offers O(L log L) complexity, improving scalability.
The model is portable and easy to deploy using standard libraries.
Abstract
The scalability of Large Language Models to long sequences is hindered by the quadratic cost of attention and the limitations of positional encodings. To address these, we introduce Caracal, a novel architecture that replaces attention with a parameter-efficient, O(L log(L)) Multi-Head Fourier (MHF) module. Our contributions are threefold: (1) We leverage the Fast Fourier Transform (FFT) for sequence mixing, inherently addressing both bottlenecks mentioned above. (2) We apply a frequency-domain causal masking technique that enforces autoregressive capabilities via asymmetric padding and truncation, overcoming a critical barrier for Fourier-based generative models. (3) Unlike efficient models relying on hardware-specific implementations (e.g., Mamba), we uses standard library operators. This ensures robust portability, eliminating common deployment barriers. Evaluations demonstrate that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
