RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

Xiuying Wei; Caglar Gulcehre

arXiv:2602.18196·cs.LG·May 21, 2026

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

Xiuying Wei, Caglar Gulcehre

PDF

1 Repo 1 Models

TL;DR

RAT+ introduces a dense pretraining approach for attention models that enables flexible switching between dense and dilated sparse attention at inference, maintaining high accuracy with reduced computational costs.

Contribution

A novel dense pretraining architecture with recurrence and active learning that allows a single model to adapt to various sparse attention configurations without retraining.

Findings

01

RAT+ closely matches dense accuracy at D=16.

02

At D=64, RAT+ drops only 2-3 points in accuracy.

03

Scaling to larger models yields even better performance with significant FLOPs reduction.

Abstract

Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wimh966/rat-plus
github

Models

🤗
barpitf/ratplus
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.