Scaling Reasoning without Attention
Xueliang Zhao, Wei Wu, Lingpeng Kong

TL;DR
This paper introduces \\ourmodel, an attention-free language model built on SSD layers that achieves efficient, fixed-memory inference and outperforms larger models on reasoning benchmarks through a novel curriculum fine-tuning approach.
Contribution
The paper presents a new attention-free language model based on SSD layers and a curriculum fine-tuning strategy for complex reasoning tasks, demonstrating superior performance.
Findings
ourmodel-7B surpasses comparable Transformer models on reasoning benchmarks.
ourmodel-7B outperforms larger Gemma3-27B model on AIME and Livecodebench.
The model achieves fixed-memory, constant-time inference without self-attention.
Abstract
Large language models (LLMs) have made significant advances in complex reasoning tasks, yet they remain bottlenecked by two core challenges: architectural inefficiency due to reliance on Transformers, and a lack of structured fine-tuning for high-difficulty domains. We introduce \ourmodel, an attention-free language model that addresses both issues through architectural and data-centric innovations. Built on the state space dual (SSD) layers of Mamba-2, our model eliminates the need for self-attention and key-value caching, enabling fixed-memory, constant-time inference. To train it for complex reasoning, we propose a two-phase curriculum fine-tuning strategy based on the \textsc{PromptCoT} synthesis paradigm, which generates pedagogically structured problems via abstract concept selection and rationale-guided generation. On benchmark evaluations, \ourmodel-7B outperforms strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Explainable Artificial Intelligence (XAI)
