TL;DR
RAT+ introduces a dense pretraining approach for attention models that enables flexible switching between dense and dilated sparse attention at inference, maintaining high accuracy with reduced computational costs.
Contribution
A novel dense pretraining architecture with recurrence and active learning that allows a single model to adapt to various sparse attention configurations without retraining.
Findings
RAT+ closely matches dense accuracy at D=16.
At D=64, RAT+ drops only 2-3 points in accuracy.
Scaling to larger models yields even better performance with significant FLOPs reduction.
Abstract
Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
