TL;DR
Nautile-370M is a small, efficient language model combining spectral sequence operators with attention to enhance reasoning capabilities within strict resource constraints.
Contribution
The paper introduces Nautile-370M, a novel hybrid model architecture that integrates spectral sequence operators with attention, demonstrating expressive power and efficiency for reasoning tasks.
Findings
SCA readout can exactly retrieve any token from prefix summaries.
SCA can reproduce any softmax attention output, matching full self-attention.
Nautile-370M performs reasoning tasks efficiently with fewer parameters.
Abstract
We present Nautile-370M, a 371-million-parameter small language model designed for efficient reasoning under strict parameter and inference budgets. Nautile-370M uses a hybrid backbone in which two SeqCond Attention (SCA) layers, a linear-time spectral sequence operator inspired by SeqCondenser, alternate with one transformer layer. This design aims to retain the long-context efficiency and state-tracking benefits of structured sequential models while preserving the expressive token-to-token routing of attention. The model was trained on a single Cloud TPU v4-64 pod slice provided through the Google TPU Research Cloud (TRC) program; the subsequent reinforcement learning stage was carried out on a single NVIDIA DGX Spark. We prove that the SCA readout mechanism can exactly retrieve any individual token from the prefix summary and can reproduce any output of softmax attention as a special…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
